Taking Compute Nodes Down for Maintenance

Expand your knowledge of hardware, software and supercomputing

Browse Product Catalog Get our Downloads Subscribe to KB alerts Submit a question to KB

When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service for the three most common schedulers our customers use.

Grid Engine:
Use qmod -d or -e to disable or enable, and queuename@hostname. You can use * for all queues on a host. Examples:
Disable: qmod -d *@node01
Enable: qmod -e *@node01

Slurm:
Modify the state with scontrol, specifying the node and the new state. You must provide a reason when disabling a node.
Disable: scontrol update NodeName=node[02-04] State=DRAIN Reason=”Cloning”
Enable: scontrol update NodeName=node[02-04] State=RESUME

Torque:
The pbsnodes command is used to make a node unavailable/available in Torque.
Disable: pbsnodes -o node05
Enable: pbsnodes -r node05

There are a lot of control options for queues, hosts, and other objects within the three most common schedulers. These commands are a good way to get started with maintaining individual nodes while keeping the rest of your cluster in production.

Get More Tech Tips
Visit the Advanced Clustering Technologies Knowledge Base for more tech tips from our HPC engineers.

Recent KB articles

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.

Get Breakin

Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Go to the Product Catalog to Use the Configurator

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

Name*
First Last
Email address*
Phone
Message*
Email
This field is for validation purposes and should be left unchanged.

Knowledge base

Expand your knowledge of hardware, software and supercomputing