Expand your knowledge of hardware, software and supercomputing

Repairing a corrupted SGE database

Note: Understanding the cause of sgemaster failing to start is important.  Before running these steps, there should be some indication of a database corruption issue in the logs.  These logs are located in /act/sge/default/spool/qmaster/messages.  A typical corruption error message may look like this:

03/07/2015 17:34:07| main|head|E|couldn't open berkeley database "sge": (22) Invalid argument
03/07/2015 17:34:07| main|head|E|startup of rule "default rule" in context "berkeleydb spooling" failed
03/07/2015 17:34:07| main|head|C|setup failed

or

03/12/2015 13:07:08| main|head|E|couldn't open database environment for server "local spooling", directory "/act/sge/default/spool/spooldb": (-30974) DB_RUNRECOVERY: Fatal error, run database recovery
03/12/2015 13:07:08| main|head|E|startup of rule "default rule" in context "berkeleydb spooling" failed
03/12/2015 13:07:08| main|head|C|setup failed

If your filesystem ever fills up or the system crashes as the wrong time, your SGE database may get corrupted. In your errors, take note of the database mentioned. In our example errors “sge” is the corrupted database. Here are steps that can usually repair the “sge” database so SGE will run properly again. The same steps below will work with the “sge_job” database as well.

cd $SGE_ROOT/default/spool
cp -a spooldb spooldb.bak
cd spooldb
db_verify sge
db_recover
db_dump -f sge.out sge
mv sge sge.old
db_load -f sge.out sge
db_verify sge
chown -R sgeadmin. $SGE_ROOT/default/spool

If the above does not work, this alternative method may work instead.

cd /act/sge
./sge_inst -bak

This starts an interactive backup script. Choose the default answers. Optionally, selecting not to use tar/gzip will make the backups easier to inspect.  The settings are saved to /act/sge/backup.  To fix the database corruption, simply restore this backup with the following.

./sge_inst -rst

This starts another interactive script, but to restore from backup. Answer all the questions, which should have correct default answers.  You can then start sgemaster without any issues.

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.