Q-Logic IB6054601-00 D Switch User Manual


 
C – Troubleshooting
InfiniPath MPI Troubleshooting
IB6054601-00 D C-27
Q
$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000
MPIRUN: MPI progress Quiescence Detected after 9000 seconds.
MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.
MPIRUN: Per-rank details are the following:
MPIRUN: Rank 0 (<nodename>) caused MPI progress Quiescence.
MPIRUN: Rank 1 (<nodename>) caused MPI progress Quiescence.
MPIRUN: both MPI progress and Ping Quiescence Detected after 120
seconds.
Occasionally a stray process will continue to exist out of its context. mpirun checks
for stray processes; they are killed after detection.The following is an example of
the type of message you will see in this case:
$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000
iqa-38: Received 1 out-of-context eager message(s) from stray
process PID=29745
running on host 192.168.9.218
iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I
am a stray process, exiting.
2000 5.222116
iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host
IP=192.168.9.218 sent
1 stray message(s) and was told so 1 time(s) (first stray message
at 0.7s (13%),last at 0.7s (13%) into application run)
The following should never occur. Please inform Support if it does:
Internal Error: NULL function/argument found:func_ptr(arg_ptr)
C.8.12.3
Driver and Link Error Messages Reported by MPI Programs
Two types of error messages are described below.
1. When the InfiniBand link fails during a job, a message will be reported once
per occurrence. The message will be similar to this:
ipath_check_unit_status: IB Link is down
This can happen when a cable is disconnected, a switch is rebooted, or if there
are other problems with the link. The job will continue retrying until the
quiescence interval expires. See the mpirun -q option for information on
quiescence.
2. If a hardware problem occurs, an error similar to this will be reported:
infinipath: [error strings] Hardware error
This will cause the MPI program to terminate. The error string may provide
additional information as to the problem. To further determine the source of the
problem, examine syslog on the node reporting the problem.