Troubleshooting OpenMPI Invocation Problems
OpenMPI works with a large number of transport mechanisms, from shared memory on the local machine, to IP over Ethernet or even RDMA over InfiniBand. With default settings, when you start your program using mpirun, OpenMPI will choose the best interface available.. Unfortunately, the logic isn’t foolproof, and sometimes you will hit snags and your job will appear to hang without even running your code.
The first step to troubleshooting OpenMPI invocation problems is to add the –debug-devel parameter, or -d. Unlike the –debug and –debugger arguments, which are used to invoke parallel debuggers to debug user code, the –debug-devel option will increase OpenMPI log verbosity.
Adding this to your mpirun command will generate a lot more feedback to tell you which transport medium it is using, addresses involved, and can give you hints on where to continue your investigation.
It’s important to remember that whichever transports are used with OpenMPI, IP will always be used for initial job tree setup. Even if you’re using InfiniBand, routing or MTU issues on your cluster’s IP network can prevent the MPI job from starting on each node.
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.