RAM – Checking for errors
It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors.
memtest86+ is a free utility that will test writing and reading to the systems RAM. If your system does not already have memtest86+ as a boot option you can add it in CentOS by doing the following:
$ yum install memtest86+
This will both install memtest86+ and run the initial setup to add it to the boot options in grub. When you are ready to run the test, reboot the machine and look for the Memtest86+ option on the grub boot option list.
Check system logs
Memory related errors can appear in many different ways. The following files are a good place to scan through for any errors related to memory.
$ cat /var/log/messages | less
$ cat /var/log/mcelog
If your DIMMs have ECC capability the edac-util program can read information from EDAC (Error Detection and Correction) drivers in the kernel, using files exported by these drivers to record corrected and non-corrected errors. This can also be useful for narrowing down which DIMM errors are coming from.
$ edac-util -v
If you are unsure about any of the output from the utilities above you can send the output to [email protected] and we will gladly look over the output for you.
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.