slurm-mpi-arrays - Center for High Performance Computing

Running MPI Jobs

One option is to produce the hostfile and feed it directly to the mpirun command of the appropriate MPI distribution. The disadvantage of this approach is that it does not integrate with SLURM and as such it does not provide advanced features such as task affinity, accounting, etc.

Another option is to use process manager built into SLURM and launch the MPI executable through srun command. How to do this for various MPI distributions is described at http://slurm.schedmd.com/mpi_guide.html. Some MPI distributions' mpirun commands integrate with Slurm and thus it is more convenient to use them instead of srun.

For MPI distributions at CHPC, the following works (assuming MPI program internally threaded with OpenMP).

Intel MPI

module load [intel,gcc] impi
# for a cluster with Ethernet only, set network fabrics to TCP
setenv I_MPI_FABRICS shm:tcp
# for a cluster with InfinBand, set network fabrics to OFA
setenv I_MPI_FABRICS shm:ofa
# on lonepeak owner nodes, use the TMI interface (InfiniPath)
setenv I_MPI_FABRICS shm:tmi

# IMPI option 1 - launch with PMI library - currently not using task affinity, use mpirun instead
setenv I_MPI_PMI_LIBRARY /uufs/CLUSTER.peaks/sys/pkg/slurm/std/lib/libpmi.so
#srun -n $SLURM_NTASKS $EXE >& run1.out
# IMPI option 2 - bootstrap
mpirun -bootstrap slurm -np $SLURM_NTASKS $EXE  >& run1.out

MPICH2

Launch the MPICH2 jobs with mpiexec as explained in http://slurm.schedmd.com/mpi_guide.html#mpich2. That is:

module load [intel,gcc,pgi] mpich2
setenv MPICH_NEMESIS_NETMOD mxm # default is Ethernet, choose mxm for InfiniBand
mpirun -np $SLURM_NTASKS $EXE

OpenMPI

Use the mpirun command from the OpenMPI distribution. There's no need to specify the hostfile as OpenMPI communicates with Slurm in that regard. To run:

module load [intel,gcc,pgi] openmpi
mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # in case of Ethernet network cluster, such as general lonepeak nodes.
mpirun -np $SLURM_NTASKS $EXE # in case of InfiniBand network clusters

Note that OpenMPI supports multiple network interfaces and as such it allows for single MPI executable across all CHPC clusters, including the InfiniPath network on lonepeak.

MVAPICH2

MVAPICH2 executable can be launched with mpirun command (preferably) or with srun, in which case one needs to use --mpi=none flag. To run multi-threaded code, make sure to set OMP_NUM_THREADS and MV2_ENABLE_AFFINITY=0 (ensure that the MPI tasks don't get locked to single core) before calling the srun.

module load [intel,gcc,pgi] mvapich2
setenv OMP_NUM_THREADS 6  # optional number of OpenMP threads
setenv MV2_ENABLE_AFFINITY 0 # disable process affinity - only for multi-threaded programs
mpirun -np $SLURM_NTASKS $EXE # mpirun is recommended
srun -n $SLURM_NTASKS --mpi=none $EXE # srun is optional
x

Sample MPI job Slurm script

#!/bin/csh
#SBATCH --time=1:00:00 # 1 hour walltime, abbreviated by -t
#SBATCH --nodes=2      # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values
#SBATCH --ntasks=16    # number of MPI tasks, abbreviated by -n
#SBATCH --account=baggins     # account - abbreviated by -A
#SBATCH --partition=lonepeak  # partition, abbreviated by -p

# set data and working directories
setenv WORKDIR $HOME/mydata
setenv SCRDIR /scratch/general/vast/$USER/$SLURM_JOBID
mkdir -p $SCRDIR
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR

# load appropriate modules, in this case Intel compilers, MPICH2
module load intel mpich2
# for MPICH2 over Ethernet, set communication method to TCP
# see below for network interface selection options for different MPI distributions
setenv MPICH_NEMESIS_NETMOD tcp
# run the program
# see below for ways to do this for different MPI distributions
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

The #SBATCH declaratives denotes the SLURM flags, which passes job requirements to Slurm. The rest of the script is instructions on how to run your job. Note that we are using the built-in $SLURM_NTASKS variable to denote the number of MPI tasks to run. In case of a plain MPI job, this number should equal number of nodes ($SLURM_NNODES) multiplied by the number of cores per node.

Also note that some packages have not been built with the MPI distributions that support Slurm, in which case you'll need to specify the hosts to run on via a machinefile flag that is passed to the mpirun command and the appropriate MPI distribution. Please visit the package help page for details and the appropriate script(s) in these cases. Additional information on the creation of a machinefile is also given in a table below discussing Slurm environmental variables.

For mixed MPI/OpenMP runs, you can either hard code the OMP_NUM_THREADS in the script, or, use logic such as that shown below to set it from the Slurm job information. When requesting resources, request the number of MPI tasks and number of nodes to run on, not the total number of cores the MPI/OpenMP tasks will use.

#SBATCH -N 2
#SBATCH -n 4
#SBATCH -C "c12"  # we want to run on uniform core count nodes

# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS
# set thread affinity to CPU socket
setenv KMP_AFFINITY verbose,granularity=core,compact,1,0

mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

Alternatively, the Slurm directive option -c , or --cpus-per-task can be used, like:

#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6

setenv OMP_NUM_THREADS $SLURM_PUS_PER_TASK
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

Note that if you use this on a cluster with nodes that have varying core counts, Slurm is free to pick any node that match the minimum requirements of your job, denoted by the #SBATCH directives, so the job nodes may be under subscribed (e.g. on ash, the above option would fully subscribe 12 core nodes, but, undersubscribe the 20 or 24 core nodes).

"Access denied by pam_slurm_adopt" Error

We have received reports that when running MPI jobs under certain circumstances, specifically when the job does not have any initial setup and therefore starts with the mpirun step, a race condition can occur where the job tries to start before the worker nodes are ready, resulting in this error:

Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.

In this case, the solution is to add a sleep before the mpirun:

sleep 30

This issue stems from a tool CHPC uses on all clusters, "pam_slurm_adopt," to help capture processes begun with ssh into the right cgroup that would of otherwise land outside the cgroup. To do this, the pam_slurm_adopt script has the remote system return the node that the mpirun/ssh call was made on to confirm that the associated job is on the new node and adopt this process into the cgroup. These processes initiated with ssh contrast with processes initiated by the 'srun' command, as these processes go through the usual Slurm paths that do not cause the same back and forth callbacks as Slurm spawns the remote process right into the cgroup.

Using /scratch/local

Users can no longer create directories in the top level /scratch/local directory. Instead, as part of the Slurm job prolog (before the job is started), a job level directory, /scratch/local/$USER/$SLURM_JOB_ID, will be created. Only the job owner will have access to this directory. At the end of the job (slurm epilog), this job level directory will be removed.

These changes are being made to solve two main issues. First, this process will achieve isolation among jobs, thereby allowing for cleanup of the /scratch/local file system and will eliminate cases where a new job starts on a cluster with a full or nearly full /scratch/local . Second, this will also eliminate the situation where a job began, only to fail when it cannot write to /scratch/local due to hardware issues. By moving the top level directory creation to the job prolog, if there are any issues creating this job level directory, the node will be off-lined, and the job prolog will find a new node for the job.

As an example for a script for a single node job making use of /scratch/local:

#!/bin/csh
#SBATCH --time=1:00:00 # walltime, abbreviated by -t
#SBATCH --nodes=1      # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values
#SBATCH --account=baggins     # account - abbreviated by -A
#SBATCH --partition=lonepeak  # partition, abbreviated by -p

# set data and working directories
setenv WORKDIR $HOME/mydata
setenv SCRDIR /scratch/local/$USER/$SLURM_JOB_ID
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR

# load appropriate modules
module load yourappmodule
# run the program
myprogram myinput > myoutput

# copy output back to WORKDIR
cp $SCRDIR/myoutput $WORKDIR/.

If your script currently includes mkdir -p /scratch/local/$USER/$SLURM_JOB_ID, it will still run properly with this change. Depending on your program, you may want to run it such that any output necessary for restarting your program be written to your home directory or group space instead of to the local scratch space.

A Note on Shared Nodes, Arrays, and Multiple Serial Calculations within one Job

On July 1, 2019 node sharing was enabled on all CHPC clusters in the general environment. This is the best option when a user has a single job that does not need an entire node. For more information see the Node Sharing page.

Please note in cases where a user has many calculations, whether they need a portion of a node or the entire node, to submit at the same time, there may be better options than submitting each calculation as a separate batch submission; for these cases see our page dedicated to running multiple serial jobs for details as well as our documentation on job arrays.

Slurm, MPI, and Multiple Serial Calculations