Failed hardware can demote a productive cluster to a piece of data center decoration. At Advanced Clustering, we take many steps to ensure that all hardware (computers, switches, consoles) arrive in working order.
Upon the completion of assembly, each computer is verified to power on. Once powered on, each machine is programmatically checked to contain the correct hardware including CPUs, RAM, hard drives, and add-in cards. Breakin, our stress-test tool, is then started on each system. Each computer will run Breakin for 24 to 48 hours. During this time we see the largest number of component failures. Breakin's verbose display allows our technicians to track down the offending hardware quickly.
After surviving Breakin, the computers have their OS installed and are integrated in to their cluster. At this point we run the machines as a cluster. First, we check the network performance or each network with Intel MPI Benchmarks. This benchmark thoroughly stresses the network and allows us to see a poor performing switch, network card, or host computer.
The second test we run is HPCC. Depending on the processor type Intel's MKL or AMD's ACML math libraries are used. We generate an hpccinf.txt from our online tool for the run. This test stresses the motherboard, CPUs, RAM, and the network interfaces. This portion of our testing is second only to Breakin with regards to finding component errors.
Lastly, if requested, we will run a sample job provided by the end user or allow remote access to the cluster so the user can run their code directly. This ensures that the cluster is stable for the user and that the cluster is performing as expected.
Advanced Clustering has these in-depth testing procedures in place to allow us to deliver working, turn-key clusters. The extra time and care at our facility reduces headaches and downtime for our customers by delivering a tested system every time.