Upgrading Firmware when Adding InfiniBand to an Existing Fabric
A customer recently asked, “When adding a new InfiniBand switch to an existing fabric, should the firmware on the existing switches be upgraded to the version of the firmware on the new switch before connecting the new switch?” It is not required for all switches in an InfiniBand network to have matching firmware. Since adding […]
Updating firmware on your ACT Intel system
ACT’s servers based on Intel chassis can now be updated easier than before. We provide a package in our YUM repository that includes firmware updates and scripts to apply the updates. Here is how to do it. Make sure you have the ACT repo enabled. Run yum repolist and look for a repo named “ACT […]
Check the status of an LSI raid card battery backup unit
Checking on the status of your raid cards battery backup unit (BBU) is a simple process by using the following MegaCli command: $ MegaCli64 -AdpBbuCmd -a<adapter#/ALL> In the following example we have a single controller present and will pass the -a0 argument to select the controller. [root@localhost ~]# MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: […]
Setup ACT Breakin hardware diagnostics tool as a grub boot option
Breakin is Advanced Clustering Technologies stress-test and hardware diagnostics tool. It is extremely useful for detecting errors on your system while stress testing the hardware at the same time in order to create a more realistic test environment. This guide is best used for head nodes and workstations that do not have a built in […]
Server doesn’t POST – Determining if an DIMM, CPU, or MotherBoard is faulty
In this example we will troubleshoot when a server fully powers on but does not post. The three most common reasons why a server will not post is either a bad DIMM, bad CPU, or bad motherboard. The main objective of all this is to start with a minimum amount of components in the server, […]
What is a kernel panic?
A message displayed by the Linux kernel upon detecting an internal system error from which it cannot recover. Kernel panics are often software errors, but many times can an indicator of hardware issues. Common types of kernel panics The two most common types of kernel panics are: Kernel panic: VFS: Unable to mount root fs […]
Replacing an LSI raid disk with MegaCli
If you have identified a failed, or failing disk, it is possible to replace it using the MegaCli utility. In the example below we will cover replacing a failed disk from a raid 5 that has three disks total. The first thing we want to check is the status of our raid 5. [root@raid log]# MegaCli64 […]
Test a compute node’s hardware with Breakin
Clusters built by Advanced Clustering Technologies come with the ability to easily set compute nodes to be able to boot to our Breakin utility to stress test the machine. This is an easy way to test the node for hardware errors. To set a compute node to boot to Breakin from the head node: $ […]
How to locate a physical disk in an LSI raid array
The MegaCli command line utility can be used to locate a physical disk in an LSI raid array by blinking the disks activity LED. The blinking will continue until directed to stop. Syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example we will locate disk 0 on adapter 0: [root@localhost MegaCli]# ./MegaCli64 -PdLocate -start -physdrv[252:0] […]
RAM – Checking for errors
Run BreakIn It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors. Breakin for compute nodes Breakin for head nodes and CentOS work stations Run memtest86+ memtest86+ is a free utility […]
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (16)
Request a Consultation from our team of HPC Experts
Would you like to speak to one of our HPC experts? We are here to help you. Submit your details, and we'll be in touch shortly.