Expand your knowledge of hardware, software and supercomputing

Tech Support Advisory: Yum updates fail from slurm package conflicts

When performing a yum update or dnf update on your system, the update may fail with messages about conflicts between Slurm packages. This is caused by the addition of new Slurm packages in upstream repos that collide with custom packages installed by ACT. The errors may look like some of the following: Transaction check error: […]

Fixing Firewall Zones In CentOS 7.5

As of CentOS 7.5, the use of ZONE=<zone> no longer works in /etc/sysconfig/network-scripts/ifcfg-* files. The most notable side-effect of this is that all nodes that accessed the Internet through the head node will no longer be able to do so until this is remedied. The new way of setting up zones in the firewall is […]

Upgrading Firmware when Adding InfiniBand to an Existing Fabric

A customer recently asked, “When adding a new InfiniBand switch to an existing fabric, should the firmware on the existing switches be upgraded to the version of the firmware on the new switch before connecting the new switch?” It is not required for all switches in an InfiniBand network to have matching firmware.  Since adding […]

Updating firmware on your ACT Intel system

ACT’s servers based on Intel chassis can now be updated easier than before.  We provide a package in our YUM repository that includes firmware updates and scripts to apply the updates.  Here is how to do it. Make sure you have the ACT repo enabled.  Run yum repolist and look for a repo named “ACT […]

Finding the serial number of your ACT system

When contacting support, it’s best to find the serial number of your ACT system and have it handy when you open a ticket via the website, via email, or when calling in.  Providing the serial number allows us to quickly look up your system, and see the configuration of your system which is often relevant […]

Check the status of an LSI raid card battery backup unit

Checking on the status of your raid cards battery backup unit (BBU) is a simple process by using the following MegaCli command: $ MegaCli64 -AdpBbuCmd -a<adapter#/ALL> In the following example we have a single controller present and will pass the -a0 argument to select the controller. [root@localhost ~]# MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: […]

Setup ACT Breakin hardware diagnostics tool as a grub boot option

Breakin is Advanced Clustering Technologies stress-test and hardware diagnostics tool. It is extremely useful for detecting errors on your system while stress testing the hardware at the same time in order to create a more realistic test environment. This guide is best used for head nodes and workstations that do not have a built in […]

Server doesn’t POST – Determining if an DIMM, CPU, or MotherBoard is faulty

In this example we will troubleshoot when a server fully powers on but does not post. The three most common reasons why a server will not post is either a bad DIMM, bad CPU, or bad motherboard. The main objective of all this is to start with a minimum amount of components in the server, […]

What is a kernel panic?

A message displayed by the Linux kernel upon detecting an internal system error from which it cannot recover. Kernel panics are often software errors, but many times can an indicator of hardware issues. Common types of kernel panics The two most common types of kernel panics are: Kernel panic: VFS: Unable to mount root fs […]

What do I need to do when replacing a motherboard?

After replacing a failed motherboard, steps need to be taken to allow the network configuration in Linux work without disruption.  Here, we outline the steps to take on an Enterprise Linux system.  Console access is required for the node getting the replacement; the local steps can be taken as soon as the motherboard is replaced […]

Sync users across nodes

Any time you add a new user on your cluster’s head node or make changes to an existing user, you will need to synchronize those changes across the entire cluster. Advanced Clustering makes this a simple task by using our act_authsync utility. This utility takes all system user configuration files and pushes them out to […]

Replacing an LSI raid disk with MegaCli

If you have identified a failed, or failing disk, it is possible to replace it using the MegaCli utility. In the example below we will cover replacing a failed disk from a raid 5 that has three disks total. The first thing we want to check is the status of our raid 5. [root@raid log]# MegaCli64 […]

Test a compute node’s hardware with Breakin

Clusters built by Advanced Clustering Technologies come with the ability to easily set compute nodes to be able to boot to our Breakin utility to stress test the machine. This is an easy way to test the node for hardware errors. To set a compute node to boot to Breakin from the head node: $ […]

How to locate a physical disk in an LSI raid array

The MegaCli command line utility can be used to locate a physical disk in an LSI raid array by blinking the disks activity LED. The blinking will continue until directed to stop. Syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example we will locate disk 0 on adapter 0: [root@localhost MegaCli]# ./MegaCli64 -PdLocate -start -physdrv[252:0] […]

RAM – Checking for errors

Run BreakIn It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors. Breakin for compute nodes Breakin for head nodes and CentOS work stations Run memtest86+ memtest86+ is a free utility […]

Repairing a corrupted SGE database

Note: Understanding the cause of sgemaster failing to start is important.  Before running these steps, there should be some indication of a database corruption issue in the logs.  These logs are located in /act/sge/default/spool/qmaster/messages.  A typical corruption error message may look like this: 03/07/2015 17:34:07| main|head|E|couldn’t open berkeley database “sge”: (22) Invalid argument 03/07/2015 17:34:07| […]

Using the ACT Yum Repo

Advanced Clustering Technologies maintains a software repository called actrepo for our ACT Utilities and other commonly used cluster software. To access the ACT yum repo, install actrepo RPM with these commands: CentOS 5 $ rpm -Uvh http://lab.advancedclustering.com/yum/centos5/actrepo-1.0-centos5.noarch.rpm CentOS 6 $ rpm -Uhv http://lab.advancedclustering.com/yum/centos6/actrepo-1.0-centos6.noarch.rpm CentOS 7 $ yum -y install http://lab.advancedclustering.com/yum/actel7/actrepo-7.0-el7.noarch.rpm

An Easier Way to Back Up Your HPC Cluster

Last month we reviewed the importance of making backups. Perhaps the simplest form of backup can occur by taking an image of the head node. Today, Advanced Clustering Technologies releases an update to the Cloner utility that makes this a whole lot easier.  The new cloner_usb command will create a bootable USB key which can restore […]

Installing Libraries for Python Outside of System Directories

Python is being used more frequently in HPC applications. Whether a job is being run by the scheduler or pre/post-processing on login nodes, there’s a chance you may run into it. With Python comes the need for libraries. Installing the libraries in system directories normally isn’t possible, but there is a good solution for that. […]

Taking Compute Nodes Down for Maintenance

When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service […]

Pinpoint a failed drive in your array

If you see that your LSI RAID array has a failed disk, but you’re not sure which physical disk in the machine it is, use the MegaCli command line utility to flash the drive’s LEDs: Command syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example, we will locate disk 0 on adapter 0 (the first […]

Getting package information

By using the ‘rpm’ command (RPM Package Manager) is is possible to get a lot of information about installed packages on your system. To start, say we want to see if we have a specific package name installed on our system. We can search all the currently installed packages for a package named ‘actutil’ by: […]

Viewing your system’s event log through IPMI

If your system has IPMI (Intelligent Platform Management Interface), it can be useful to pull its system event log when encountering odd behavior. If you have a cluster installed with our act_utils software tools, you can use the act_ipmi_log command (replace “node01″ with the hostname of the machine you wish to query): $ act_ipmi_log -n […]

What type of power receptacles do I have?

NEMA Receptacle Types There are many different types of NEMA power receptacles and plugs. If you already have power receptacles installed at your site and you are wanting to determine what type of NEMA plug you have, included below you will find the most common types of NEMA receptacles for the PDUs we sell at […]

Using VNC to Speed Up Slow X-forwarded Sessions

Most of you know that you can use X-forwarding built into SSH to run a graphical application on a remote host: laptop$ ssh -X head.mycluster head$ firefox & (Firefox session displays on your laptop, running on the remote host) But sometimes these programs run very slowly over the network. Firefox can be slow to render, […]

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.