Expand your knowledge of hardware, software and supercomputing

How to identify and prevent overheating

How to identify and prevent overheating

Symptoms of Overheating

  • Turning off on its own
  • Freezing
  • Frequent memory errors

Most commonly a computer that is overheating will turn off unexpectedly, and repeat the behavior shortly after being turned back on. What causes this behavior is that the CPU temperatures are always monitored and the system will be immediately turned off if temperatures get too high.

Check that all fans are working

Place the server in an area where you can easily take the chassis cover off and watch all the fans while powering the server on. Look for fans that are not spinning at all, but also not spinning as quickly as the others. Typically you will be able to hear if there are fans spinning at different speeds.

If a fan is not spinning correctly it will need to be replaced, unless a build up of dust can easily be blown out and corrects the problem.

Re-seat CPU heat sinks

Sometimes when the server is moved the contact between the CPU and heat sink can be disrupted. To be sure, completely remove the heat sinks, check that the thermal paste is still intact, and firmly re-seat them. If you are not sure if contact is being made a good test is to:

  • completely remove all current thermal paste and re-apply it to the heat sink
  • re-seat and then remove the heat sink again
  • check to see if the thermal paste was spread out on the CPU by the heat sink making contact

Clean with canned air

Look for build up of dust and blow it out with canned air. Be sure to pay special attention to the heat sinks, fans, and around the base of the CPUs. This is considered a good practice to have for regular preventative maintenance.

Compute node air ducts

Advanced Clustering Half-U compute nodes will have an fitted air duct that guides air from the fans over the CPUs. For proper air flow it is important to keep the air duct in place when the node is powered on.

Check temperature with act_sensors

If your server is part of a cluster with the package act_dir installed you can use the act_sensors utility to check temperatures on every node.

To check temperatures on every node in the cluster:

$ act_sensors -a temps

To check a specific node:

$ act_sensors -n nodename temps

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC Experts

Would you like to speak to one of our HPC experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.