Expand your knowledge of hardware, software and supercomputing

RAM – Checking for errors

Run BreakIn

It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors.

Run memtest86+

memtest86+ is a free utility that will test writing and reading to the systems RAM. If your system does not already have memtest86+ as a boot option you can add it in CentOS by doing the following:

$ yum install memtest86+
$ memtest-setup

This will both install memtest86+ and run the initial setup to add it to the boot options in grub. When you are ready to run the test, reboot the machine and look for the Memtest86+ option on the grub boot option list.

Check system logs

Memory related errors can appear in many different ways. The following files are a good place to scan through for any errors related to memory.

$ cat /var/log/messages | less
$ cat /var/log/mcelog
$ dmesg

If your DIMMs have ECC capability the edac-util program can read information from EDAC (Error Detection and Correction) drivers in the kernel, using files exported by these drivers to record corrected and non-corrected errors. This can also be useful for narrowing down which DIMM errors are coming from.

$ edac-util -v

If you are unsure about any of the output from the utilities above you can send the output to support@advancedclustering.com and we will gladly look over the output for you.

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC Experts

Would you like to speak to one of our HPC experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.