C – Troubleshooting
InfiniPath MPI Troubleshooting
IB6054601-00 D C-13
Q
C.6.1
Broken Intermediate Link
Sometimes message traffic passes through the fabric while other traffic appears to
be blocked. In this case, MPI jobs fail to run.
In large cluster configurations, switches may be attached to other switches in order
to supply the necessary inter-node connectivity. Problems with these inter-switch
(or intermediate) links are sometime more difficult to diagnose than failure of the
final link between a switch and a node. The failure of an intermediate link may allow
some traffic to pass through the fabric while other traffic is blocked or degraded.
If you encounter such behavior in a multi-layer fabric, check that all switch cable
connections are correct. Statistics for managed switches are available on a per-port
basis, and may help with debugging. See your switch vendor for more information.
C.7
Performance Issues
Performance issues that are currently being addressed are covered in this section.
C.7.1
MVAPICH Performance Issues
MVAPICH over OpenFabrics over InfiniPath performance tuning has not yet been
done. Improved performance will be delivered in future releases.
C.8
InfiniPath MPI Troubleshooting
Problems specific to compiling and running MPI programs are detailed below.
C.8.1
Mixed Releases of MPI RPMs
Make sure that all of the MPI RPMs are from the same release. When using mpirun,
an error message will occur if different components of the MPI RPMs are from
different releases. This is a sample message in the case where mpirun from
release 1.3 is being used with a 2.0 library:
$ mpirun -np 2 -m ~/tmp/x2 osu_latency
MPI_runscript-xqa-14.0: ssh -x> Cannot detect InfiniPath
interconnect.
MPI_runscript-xqa-14.0: ssh -x> Seek help on loading InfiniPath
interconnect driver.
MPI_runscript-xqa-15.1: ssh -x> Cannot detect InfiniPath
interconnect.
MPI_runscript-xqa-15.1: ssh -x> Seek help on loading InfiniPath
interconnect driver.
MPIRUN: Node program(s) exited during connection setup