B – Integration with a Batch Queuing System
Lock Enough Memory on Nodes When Using SLURM
B-4 IB6054601-00 D
Q
The following command will terminate all processes using the InfiniPath
interconnect:
# /sbin/fuser -k /dev/ipath
For more information, see the man pages for fuser(1) and lsof(8).
NOTE: Run these commands as root to insure that all processes are reported.
B.2
Lock Enough Memory on Nodes When Using SLURM
This is identical to information provided in appendix C.8.11. It is repeated here for
your convenience.
InfiniPath MPI requires the ability to lock (pin) memory during data transfers on each
compute node. This is normally done via
/etc/initscript, which is created or
modified during the installation of the infinipath RPM (setting a limit of 64MB,
with the command "
ulimit -l 65536").
Some batch systems, such as SLURM, propagate the user’s environment from the
node where you start the job to all the other nodes. For these batch systems, you
may need to make the same change on the node from which you start your batch
jobs.
If this file is not present or the node has not been rebooted after the infinipath
RPM has been installed, a failure message similar to this will be generated:
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory
mpi_latency:
/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src
mq_ips.c:691:
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program
unexpectedly quit. Exiting.
You can check the ulimit -l on all the nodes by running ipath_checkout. A
warning will be given if ulimit -l is less that 4096.
There are two possible solutions to this. If infinipath is not installed on the node
where you start the job, set this value in the following way. You must be root to set it:
# ulimit -l 65536
Or, if you have installed infinipath on the node, reboot it to insure that
/etc/initscript is run.