Upgrading Firmware when Adding InfiniBand to an Existing Fabric
A customer recently asked, “When adding a new InfiniBand switch to an existing fabric, should the firmware on the existing switches be upgraded to the version of the firmware on the new switch before connecting the new switch?” It is not required for all switches in an InfiniBand network to have matching firmware. Since adding […]
Updating firmware on your ACT Intel system
ACT’s servers based on Intel chassis can now be updated easier than before. We provide a package in our YUM repository that includes firmware updates and scripts to apply the updates. Here is how to do it. Make sure you have the ACT repo enabled. Run yum repolist and look for a repo named “ACT […]
Check the status of an LSI raid card battery backup unit
Checking on the status of your raid cards battery backup unit (BBU) is a simple process by using the following MegaCli command: $ MegaCli64 -AdpBbuCmd -a<adapter#/ALL> In the following example we have a single controller present and will pass the -a0 argument to select the controller. [[email protected] ~]# MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: […]
Setup ACT Breakin hardware diagnostics tool as a grub boot option
Breakin is Advanced Clustering Technologies stress-test and hardware diagnostics tool. It is extremely useful for detecting errors on your system while stress testing the hardware at the same time in order to create a more realistic test environment. This guide is best used for head nodes and workstations that do not have a built in […]
Server doesn’t POST – Determining if an DIMM, CPU, or MotherBoard is faulty
In this example we will troubleshoot when a server fully powers on but does not post. The three most common reasons why a server will not post is either a bad DIMM, bad CPU, or bad motherboard. The main objective of all this is to start with a minimum amount of components in the server, […]
What is a kernel panic?
A message displayed by the Linux kernel upon detecting an internal system error from which it cannot recover. Kernel panics are often software errors, but many times can an indicator of hardware issues. Common types of kernel panics The two most common types of kernel panics are: Kernel panic: VFS: Unable to mount root fs […]
Replacing an LSI raid disk with MegaCli
If you have identified a failed, or failing disk, it is possible to replace it using the MegaCli utility. In the example below we will cover replacing a failed disk from a raid 5 that has three disks total. The first thing we want to check is the status of our raid 5. [[email protected] log]# MegaCli64 […]
Test a compute node’s hardware with Breakin
Clusters built by Advanced Clustering Technologies come with the ability to easily set compute nodes to be able to boot to our Breakin utility to stress test the machine. This is an easy way to test the node for hardware errors. To set a compute node to boot to Breakin from the head node: $ […]
How to locate a physical disk in an LSI raid array
The MegaCli command line utility can be used to locate a physical disk in an LSI raid array by blinking the disks activity LED. The blinking will continue until directed to stop. Syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example we will locate disk 0 on adapter 0: [[email protected] MegaCli]# ./MegaCli64 -PdLocate -start -physdrv[252:0] […]
RAM – Checking for errors
Run BreakIn It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors. Breakin for compute nodes Breakin for head nodes and CentOS work stations Run memtest86+ memtest86+ is a free utility […]
Pinpoint a failed drive in your array
If you see that your LSI RAID array has a failed disk, but you’re not sure which physical disk in the machine it is, use the MegaCli command line utility to flash the drive’s LEDs: Command syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example, we will locate disk 0 on adapter 0 (the first […]
What type of power receptacles do I have?
NEMA Receptacle Types There are many different types of NEMA power receptacles and plugs. If you already have power receptacles installed at your site and you are wanting to determine what type of NEMA plug you have, included below you will find the most common types of NEMA receptacles for the PDUs we sell at […]
How to identify and prevent overheating
How to identify and prevent overheating Symptoms of Overheating Turning off on its own Freezing Frequent memory errors Most commonly a computer that is overheating will turn off unexpectedly, and repeat the behavior shortly after being turned back on. What causes this behavior is that the CPU temperatures are always monitored and the system will […]
Identifying Issues with Network Connectivity
Network connectivity can cover many different areas, and diagnosing which area your problem lays in is the first step to fixing the problem. Below we will cover multiple steps for identifying a problem. Verify connections and LEDs Verify that the network cable is properly connected to the back of the computer and at the switch. […]
Standard Cluster – InfiniBand Fabric
This is the InfiniBand configuration for most of the HPC clusters we build.
How do I rack a .5U blade or a 2U Flex Chassis?
1U blade or 2U Flex Chassis installation & removal PLEASE NOTE: The pictorial illustrations in this FAQ show a 2U Flex chassis, however the same procedures are applicable to the 1U blade except for the fact that the 1U chassis is 1U shorter in height, uses a different size rear mounting bracket, and has fewer […]
Checking InfiniBand
If one of your machines has an InfiniBand device installed and you want to know what state the device is in, you can use the “ibstat” command. The output of “ibstat” shows a lot of information, but the two main lines you should look at are: State: Active Physical state: LinkUp The “State” line can […]
Installing NVIDIA Drivers on RHEL or CentOS 7
Most users of NVIDIA graphics cards prefer to use the drivers provided by NVIDIA. These more fully support the capabilities of the card when compared to the nouveau driver that is included with the distribution. These are the steps to install the NVIDIA driver and disable the nouveau driver. Prepare your machine yum -y update yum […]
Checking and Clearing InfiniBand Errors
An easy way to check for errors on your entire cluster IB network is to run the command ‘ibcheckerrors.’ This will print any errors that can range from a port being down (even just unplugged temporarily) to transmission errors. After troubleshooting any errors you find, you can clear out the error counters with the command […]
Replacing an LSI raid card with a pre-configured raid array
Newer LSI raid cards (depending on their current firmware version it seems) will auto-import raid configurations from previous raid cards. However on older cards you have to import the disks ‘foreign’ configuration. In order to check if your raid array was automatically imported by your new raid card you can run the following command: $ MegaCli64 […]
Create a raid array with MegaCli64
Note: The following is assuming that you have attached new drives to a newly installed LSI raid controller. The first thing to do is to get a list of all the drives attached to the raid controller. The way the LSI raid controllers identify/label their attached disks is by an ‘Enclosure ID’ and the drive […]
How to expand an existing LSI raid array using MegaCli
Warning: You should ALWAYS make a backup of all of your information on the raid array before performing any of these steps. The exact commands to do this vary on your current configuration and number of disks in the raid. Before adding in the disks you need to get a feel for your current setup by […]
How to update the date/time on LSI Raid cards using MegaCli
Setting the date/time on your controller is advised to keep system logs in sync. Although this is normally done by the drivers after bootup, we can do this manually with the MegaCli tool using the following syntax: MegaCli64 -AdpSetTime yyyymmdd hh:mm:ss -a<adapter#> yyyy is year in 4 digit format: 2013 mm is month in 2 […]
What is Cli64?
Cli64 is a (poorly named) proprietary tool developed by Areca that provides reporting AND management functions from userspace. If installed from the ACT repo the binary is located at /usr/local/bin/cli64. The default password for the controller is 0000. [[email protected] ~]# cli64 ? Copyright (c) 2004-2011 Areca, Inc. All Rights Reserved. Areca CLI, Version: 1.86, Arclib: […]
My Areca raid controller is beeping; how do I make it STOP?
WARNING – Only continue with this operation after the cause of the alarm has been identified* First we must authenticate to the controller by passing a password, the default is 0000. [[email protected] ~]# cli64 set password=0000 GuiErrMsg: Success. [[email protected] ~]# Now we can mute the beeping! [[email protected] ~]# cli64 sys beeper p=0 GuiErrMsg: Success. [[email protected] […]
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.