Expand your knowledge of hardware, software and supercomputing

Repairing a corrupted SGE database

Note: Understanding the cause of sgemaster failing to start is important.  Before running these steps, there should be some indication of a database corruption issue in the logs.  These logs are located in /act/sge/default/spool/qmaster/messages.  A typical corruption error message may look like this: 03/07/2015 17:34:07| main|head|E|couldn’t open berkeley database “sge”: (22) Invalid argument 03/07/2015 17:34:07| […]

Taking Compute Nodes Down for Maintenance

When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service […]

Creating Groups of Nodes in TORQUE

Despite being a simple first in/first out (FIFO) scheduler, pbs_sched can use node properties to emulate host groups. This can be useful if you have different types of nodes that provide different types of resources. The nodes available in TORQUE are controlled by the file /var/spool/torque/server_priv/nodes. The most basic configuration simply lists the nodes and […]

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.