Q-Logic IB6054601-00 D Switch User Manual


 
C – Troubleshooting
InfiniPath MPI Troubleshooting
C-22 IB6054601-00 D
Q
If this file is not present or the node has not been rebooted after the infinipath
RPM has been installed, a failure message similar to this will be generated:
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory
mpi_latency:
/fs2/scratch/infinipath-build-2.0/mpi-2.0/mpich/psm/src
mq_ips.c:691:
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program
unexpectedly quit. Exiting.
You can check the ulimit -l on all the nodes by running ipath_checkout. A
warning will be given if
ulimit -l is less that 4096.
There are two possible solutions to this. If InfiniPath is not installed on the node
where you start the job, set this value in the following way (as root).
# ulimit -l 65536
Or, if you have installed InfiniPath on the node, reboot it to insure that
/etc/initscript is run.
C.8.12
Error Messages Generated by mpirun
In the sections below, types of mpirun error messages are described. They fall into
these categories:
Messages from the InfiniPath Library
MPI messages
Messages relating to the InfiniPath driver and InfiniBand links
Messages generated by mpirun follow a general format:
program_name: message
function_name: message
Messages may also have different prefixes, such and ipath_ or psm_, which will
indicate in which part of the software the errors are occurring.
C.8.12.1
Messages from the InfiniPath Library
These messages may appear in the mpirun output.
The first set are error messages, which indicate internal problems and should be
reported to Support.
Trying to cancel invalid timer (EOC)
sender rank rank is out of range (notification)
sender rank rank is out of range (ack)
Reached TIMER_TYPE_EOC while processing timers