ACT knowledge base

10 most recent knowledge base articles

Troubleshooting OpenMPI Invocation Problems

OpenMPI works with a large number of transport mechanisms, from shared memory on the local machine, to IP over Ethernet or even RDMA over InfiniBand. With default settings, when you start your program using mpirun, OpenMPI will choose the best interface available.. Unfortunately, the logic isn’t foolproof, and sometimes you will hit snags and your …

 

Standard Cluster – InfiniBand Networking

This is the InfiniBand configuration for most of the clusters we build.

 

Getting package information

By using the ‘rpm’ command (RPM Package Manager) is is possible to get a lot of information about installed packages on your system. To start, say we want to see if we have a specific package name installed on our system. We can search all the currently installed packages for a package named ‘actutil’ by: …

 

Changing Contents in a File in Every Node

Occasionally you may want to change a a single string inside of a file that is on every compute node. If the file was the same on every node you could change it in one place and then copy it out like so: $ act_cp -g nodes /path/to/file Some config files are unique to each …

 

Checking and Clearing Infiniband Errors

An easy way to check for errors on your entire cluster IB network is to run the command ‘ibcheckerrors.’ This will print any errors that can range from a port being down (even just unplugged temporarily) to transmission errors. After troubleshooting any errors you find, you can clear out the error counters with the command …

 

Use act_locate to identify a node

Most Advanced Clustering chassis are equipped with a large locater LED on the front that can be used to easily identify a node when it’s turned on. If you’re remotely attempting to notify a technician as to which compute node needs work, you can simply run the following command from your head node: $ act_locate …

 

Pinpoint a failed drive in your array

If you see that your LSI RAID array has a failed disk, but you’re not sure which physical disk in the machine it is, use the MegaCli command line utility to flash the drive’s LEDs: Command syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example, we will locate disk 0 on adapter 0 (the first …

 

Viewing your system’s event log through IPMI

If your system has IPMI (Intelligent Platform Management Interface), it can be useful to pull its system event log when encountering odd behavior. If you have a cluster installed with our act_utils software tools, you can use the act_ipmi_log command (replace “node01″ with the hostname of the machine you wish to query): $ act_ipmi_log -n …

 

Checking InfiniBand

If one of your machines has an InfiniBand device installed and you want to know what state the device is in, you can use the “ibstat” command. The output of “ibstat” shows a lot of information, but the two main lines you should look at are: State: Active Physical state: LinkUp The “State” line can …

 

Diagnose hardware issues with Advanced Clustering’s Breakin

If you suspect hardware problems, our clusters come with a testing facility that can test one or more nodes. Using Advanced Clustering’s Breakin software can help you look for and diagnose potential hardware issues. This software is a stress-test suite developed in-house since there were no other tools available that provided this level of rigorous …

 
Menu

Advanced Clustering Technologies