Expand your knowledge of hardware, software and supercomputing

Taking Compute Nodes Down for Maintenance

When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service for the three most common schedulers our customers use.

Grid Engine:
Use qmod -d or -e to disable or enable, and queuename@hostname. You can use * for all queues on a host. Examples:
Disable: qmod -d *@node01
Enable: qmod -e *@node01

Slurm:
Modify the state with scontrol, specifying the node and the new state. You must provide a reason when disabling a node.
Disable: scontrol update NodeName=node[02-04] State=DRAIN Reason=”Cloning”
Enable: scontrol update NodeName=node[02-04] State=RESUME

Torque:
The pbsnodes command is used to make a node unavailable/available in Torque.
Disable: pbsnodes -o node05
Enable: pbsnodes -r node05

There are a lot of control options for queues, hosts, and other objects within the three most common schedulers. These commands are a good way to get started with maintaining individual nodes while keeping the rest of your cluster in production.

Get More Tech Tips
Visit the Advanced Clustering Technologies Knowledge Base for more tech tips from our HPC engineers.

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.