SLURM Scheduler
SLURM is a scalable open-source scheduler used on a number of world class clusters. In an effort to align CHPC with XSEDE and other national computing resources, CHPC has switched clusters from the PBS scheduler to SLURM. There are several short training videos about Slurm and concepts like batch scripts and interactive jobs.
- About Slurm
- Submitting batch script
- Running interactive jobs
- Running MPI jobs
- Running shared jobs and running multiple serial calculations within one job
- Multiple jobs using job arrays
- Automatic restarting of preemptable jobs
- Handy Slurm Information
- Moab/PBS to Slurm translation
- Information on Job priority scores
- How to determine which Slurm accounts you are in
- How to log into the nodes where the job runs
- Other good sources of Information
About Slurm
Slurm – Simple Linux Utility for Resource Management is used for managing job scheduling on clusters. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers.
Using Slurm
There is a hard limit of maximum 72 hours for jobs on general cluster nodes and 14 days on owner cluster nodes.
You may submit jobs to the batch system in two ways:
- Submitting a script
- Submitting an interactive job
Submitting a script to Slurm:
The creation of a batch script
To create a batch script, use your favorite text editor to create a file which has both instructions to SLURM and instructions on how to run your job. All instructions to slurm are prefaced by the #SBATCH. It is necessary to specify both the partition and the account in your jobs on all clusters EXCEPT tangent.
#SBATCH --account=<youraccount>
#SBATCH --partition=<yourpartition>
Accounts: Your account is usually your unix group name, typically your PI's lastname. If your group has owner nodes,
the account is usually <unix_group>-<cluster_abbreviation> (where cluster abbreviation is =kp, lp, notch, ash). There is also the owner-guest
account; all users have access to this account to run on the owner nodes when they
are idle. Jobs run as owner-guest are preemptable. Note that on the ash cluster,
the owner-guest account is called smithp-guest.
Partitions: Partitions are cluster
, cluster-freecycle
,pi-cl
, cluster-guest
, where cluster is the full name of the cluster and cl is the abbreviated form (kingspeak
and kp, notchpeak and notch, ash and ash, lonepeak and lp, redwood and rw).
To view a list of accounts and partitions that are available to you, run command myallocation
.
Examples
In the examples below, we will suppose your PI is Frodo Baggins and has owner nodes on kingspeak (and not on notchpeak):
- General user example on lonepeak (no allocation required)
#SBATCH --account=baggins
#SBATCH --partition=lonepeak - General user on notchpeak with allocation (Frodo still has allocation available on
notchpeak):
#SBATCH --account=baggins
#SBATCH --partition=notchpeak - General users on notchpeak without allocation: (Frodo has run out of allocation)
#SBATCH --account=baggins
#SBATCH --partition=notchpeak-freecycle - To run on Frodo's owner nodes on kingspeak
#SBATCH --account=baggins-kp
#SBATCH --partition=baggins-kp - To run as owner-guest on notchpeak:
#SBATCH --account=owner-guest
#SBATCH --partition=notchpeak-guest - To run as owner-guest on ash:
#SBATCH --account=smithp-guest
#SBATCH --partition=ash-guest - To access notchpeak GPU nodes (need to request addition to account)
#SBATCH --account=notchpeak-gpu
#SBATCH --partition=notchpeak-gpu - To access kingspeak GPU nodes (need to request addition to account)
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu
For more examples of SLURM jobs scripts see CHPC MyJobs templates.
cluster-freecycle
partition.
Constraints: Constraints, specified with the line #SBATCH -C
, can be used to target specific nodes. The ability to use constraints is done in
conjunction with the specification of node features, which is an extension that
allow for finer grained specification of resources.
Features that we have specfied are:
- Core count on node: The features are requested with the --constraint or -C flag (described in table below)
and the core count is denoted as c#, e.g., c8, c12, c16, c20, c24, c28, c32. Features
can be combined with logical operators, such as | for or, & for and. For example,
to request 16 or 20 core nodes, do
#SBATCH -C "c16|c20"
. - Amount of memory per node: This constraint takes the form of m#, where the number is the amount, in GB) of
memory in the node, e.g., m32,m64, m96, m192. IMPORTANT: There is a difference between the use of the memory constraint,
#SBATCH -C m32
, versus the batch directive#SBATCH --mem=32000
. When using the memory batch directive you specify the number as it appears in the MEMORY entry from thesi
command, which is in MB. This will result in the job only being eligible to run on a node with at least this amount of memory. However, it also restrict the job to being able to use only this amount of memory, even if the node has more memory than this value. When using the constraint, the job will only run on a node with this constraint, and the job will have access to all of the memory of the node. - Node owner: This is given as either chpc for the nodes available to all user (general nodesO
or the group/center name for owner nodes. This can be used as a constraint to target
specific group nodes which have low use as owner-guest in order to reduce chances
of being preempted. In the output of the
si
command given above the column NODES(A/I/O/T) provides the number of nodes that are allocated/idle/offline/total, allowing for the identification of owner nodes that are not being utilized. For example, to target nodes used by group "ucgd", we can do-A owner-guest -p kingspeak-guest -C "ucgd"
. Historical usage (the past 2 weeks) of different owner node groups can be found at CHPC's constraint suggestion page. - GPUs: For the GPU nodes, the specified features includes the GPU line, e.g., geforce or tesla , and the GPU type, e.g., a100, 3090, or t4. There is additional information about specifying the GPUs being requested for a job on CHPC's GPU and Accelerator page.
- Processor architecture: This is currently only on notchpeak and redwood. This is useful for jobs where you want to restrtict the processor architecture to be used on the job. Examples are bwk for Intel Broadwell, skl for Intel Skylake processors, csl for Intel Cascade Lake, icl for Intel Icelake, npl for AMD Naples, rom for AMD Rome, mil for AMD Milan.
Below is a portion of the output of the si
command (see useful aliases section below for information on this command) run on
notchpeak, which provides a list of the features for each group of nodes:
notchpeak-gpu 3 2/1/0/3 2:16:2 188000 1800000 3-00:00:00 chpc,tesla,v100,skl,c32,m192
notchpeak-gpu 1 1/0/0/1 2:16:2 188000 1800000 3-00:00:00 chpc,geforce,2080ti,tesla,p40,skl,c32,m192
notch004
notchpeak-gpu 4 4/0/0/4 2:20:2 188000 1780000+ 3-00:00:00 chpc,geforce,2080ti,csl,c40,m192
notch[086-088,271]
notchpeak-gpu 1 1/0/0/1 8:6:2 252000 1760000 3-00:00:00 chpc,geforce,3090,mil,c48,m256
notch328
notchpeak-gpu 1 0/1/0/1 2:32:2 508000 1690000 3-00:00:00 chpc,geforce,3090,a100,rom,c64,m512
notch293
notchpeak-shared-short 2 1/1/0/2 2:26:2 380000 1800000 8:00:00 chpc,t4,csl,c52,m384
notch[308-309]
notchpeak-shared-short 2 0/2/0/2 2:32:2 508000 1700000 8:00:00 chpc,tesla,k80,npl,c64,m512
notch[081-082]
notchpeak* 4 4/0/0/4 2:16:2 92000 1800000 3-00:00:00 chpc,skl,c32,m96
notch[005-008]
notchpeak* 19 13/6/0/19 2:16:2 188000 1800000 3-00:00:00 chpc,skl,c32,m192
notch[009-018,035-043]
notchpeak* 7 4/3/0/7 2:20:2 188000 1800000 3-00:00:00 chpc,csl,c40,m192
notch[096-097,106-107,153-155]
notchpeak* 32 14/18/0/32 4:16:2 252000 3700000 3-00:00:00 chpc,rom,c64,m256
notch[172-203]
notchpeak* 2 1/1/0/2 2:16:2 764000 1700000 3-00:00:00 chpc,skl,c32,m768
notch[044-045]
notchpeak* 1 0/1/0/1 2:18:2 764000 7400000 3-00:00:00 chpc,skl,c36,m768
notch068
Reservations: Upon request we can create reservations for users to guarantee node availability
via an email to helpdesk@chpc.utah.edu. Reservation are requested with the --reservation
flag (abbreviated as -R
) followed by the reservation name, which consists of a user name followed by a number,
e.g.u0123456_1
. Thus to use an existing reservation in a job script, include #SBATCH --reservation=u0123456_1
.
QOS: QOS stands for Quality of Service. While this is not normally specified, it is
necessary in a few cases. One example is when a user needs to override the normal
3 day wall time limit. In this case, the user can request access to a special long
qos that we have set up for the general nodes of a cluster, cluster-long
, that allow for a longer wal ltime to be specified. In order to get access to the
long qos of a given clusters, send a request with an explanation on why you need a
longer wall time to helpdesk@chpc.utah.edu.
For policies regarding reservations see the Batch Policies document.
Sample MPI job Slurm script
#!/bin/csh #SBATCH --time=1:00:00 # walltime, abbreviated by -t #SBATCH --nodes=2 # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values
#SBATCH --ntasks=16 # number of MPI tasks, abbreviated by -n # additional information for allocated clusters #SBATCH --account=baggins # account - abbreviated by -A #SBATCH --partition=lonepeak # partition, abbreviated by -p
# # set data and working directories
setenv WORKDIR $HOME/mydata setenv SCRDIR /scratch/general/vast/$USER/$SLURM_JOBID mkdir -p $SCRDIR
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR
#
# load appropriate modules, in this case Intel compilers, MPICH2
module load intel mpich2
# for MPICH2 over Ethernet, set communication method to TCP
# see below for network interface selection options for different MPI distributions
setenv MPICH_NEMESIS_NETMOD tcp # run the program
# see below for ways to do this for different MPI distributions mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out
The #SBATCH option denotes the SLURM flags. The rest of the script is instructions on how to run your job. Note that we are using the SLURM built in $SLURM_NTASKS variable to denote the number of MPI tasks to run. In case of a plain MPI job, this number should equal number of nodes ($SLURM_NNODES) times number of cores per node.
Also note that some packages have not been built with the MPI distributions that support Slurm, in which case you'll need to specify the hosts to run on via machinefile flag to the mpirun command and the appropriate MPI distribution. Please, see the package help page for details and the appropriate script. Additional information on creation of a machinefile is also given in a table below discussing SLURM environmental variables.
For mixed MPI/OpenMP runs, you can either hard code the OMP_NUM_THREADS in the script, or, use logic like that below to figure it out from the Slurm job information. When requesting resources, ask for number of MPI tasks and number of nodes to run on, not for total number cores the MPI+OpenMP tasks will use.
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -C "c12" # we want to run on uniform core count nodes
# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS
# set thread affinity to CPU socket
setenv KMP_AFFINITY verbose,granularity=core,compact,1,0
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out
Alternatively, SLURM option -c
, or --cpus-per-task
can be used, like:
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6
setenv OMP_NUM_THREADS $SLURM_PUS_PER_TASK
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out
Note that if you use this on a cluster with nodes that have varying core counts, SLURM is free to pick any node so the job nodes may be under subscribed (e.g. on ash, the above option would fully subscribe 12 core nodes, but, under subscribe the 20 or 24 core nodes).
NOTE specific to MPI jobs: We have received reports that when running MPI jobs under certain circumstances (specifically when the job does not have any initial setup and therefore starts with the mpirun step) a race condition can occur where the job tries to start before the worker nodes are ready, resulting in this error:
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.
In this case, the solution is to add a sleep before the mpirun:
sleep 30
more info for pam_slurm_adopt: This issue stems from a tool CHPC uses on all clusters "pam_slurm_adopt" to help capture processes that end up being started by ssh that would otherwise land outside the cgroup and get them into the right cgroup. To do this the pam_slurm_adopt has to have the remote system talk back with the node the mpirun/ssh call was made on to find out what job the remote call came from to see if that job is on the new node and then to adopt this process into the cgroup. 'srun' on the other hand goes through the usual slurm paths that does not cause the same back and forth callbacks as it spawns the remote process right into the cgroup.
New July 2020 - NOTE specific to use of /scratch/local: Users can no longer create directories in the top level /scratch/local
directory. Instead, as part of the slurm job prolog (before the job is started),
a job level directory,/scratch/local/$USER/$SLURM_JOB_ID
, will be created. Only the job owner will have access to this directory. At the
end of the job, in the slurm job epilog, this job level directory will be removed.
As an example for a script for a single node job making use of /scratch/local:
#!/bin/csh #SBATCH --time=1:00:00 # walltime, abbreviated by -t #SBATCH --nodes=1 # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values #SBATCH --account=baggins # account - abbreviated by -A #SBATCH --partition=lonepeak # partition, abbreviated by -p # # set data and working directories
setenv WORKDIR $HOME/mydata setenv SCRDIR /scratch/local/$USER/$SLURM_JOB_ID
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR
#
# load appropriate modules
module load yourappmodule # run the program myprogram myinput > myoutput
# copy output back to WORKDIR
cp $SCRDIR/myoutput $WORKDIR/.
Note that if your script currently does mkdir -p /scratch/local/$USER/$SLURM_JOB_ID
it will still run properly with this change. Also note that depending on your program,
you may want to run it such that any output necessary for restarting your program be written to your home directory
or group space instead of to the local scratch space.
Job Submission using SLURM
In order to submit a job, one has to first login to an interactive node. Then the
job submission is done with the sbatch
command in slurm.
For example, to submit a script named script.slurm just type:
sbatch script.slurm
IMPORTANT: sbatch by default passes all environment variables to the compute node, which differs from the behavior in PBS (which started with a clean shell). If you need to start with a clean environment, you will need to use the following directive in your batch script:
#SBATCH --export=NONE
This will still execute .bashrc/.tcshrc scripts, but any changes you make in your
interactive environment will not be present in the compute session. As an additional
precaution, if you are using modules, you should use module purge
to guarantee a fresh environment.
Checking the status of your job
To check the status of your job, use the squeue
command.
squeue
Most common arguments to are -u u0123456
for listing only user u0123456
jobs, and -j job#
for listing job specified by the job number. Adding -l
(for "long" output) gives more details.
Alternatively, from the account perspective, one can use the sacct
command. This command accesses the accounting database and can give useful info about
current and past job resources usage.
Interactive batch jobs
In order to launch an interactive session on a compute node do:
salloc --time=1:00:00 --ntasks 2 --nodes=1 --account=chpc --partition=notchpeak
The srun flags can be abbreviated as:
salloc -t 1:00:00 -n 2 -N 1 -A chpc -p notchpeak
The above command by default passes all environment variables of the parent shell therefore the X window connection gets preserved as well, allowing for running graphical applications such as GUI based programs inside the interactive job.
CHPC cluster queues tend to be very busy; it may take some time for an interactive
job to start. For this reason, in March 2019, we have added two nodes in a special
partition on the notchpeak cluster that are geared more towards interactive work.
Job limits on this partition are 8 hours wall time, a maximum ten submitted jobs
per user, with a maximum of two running jobs with a maximum total of 32 tasks and 128
GB memory. To access this special partition, called notchpeak-shared-short
, request both an account and partition under this name, e.g.:
salloc -N 1 -n 2 -t 2:00:00 -A notchpeak-shared-short -p notchpeak-shared-short
One option is to produce the hostfile and feed it directly to the mpirun command of the appropriate MPI distribution. The disadvantage of this approach is that it does not integrate with SLURM and as such it does not provide advanced features such as task affinity, accounting, etc.
Another option is to use process manager built into SLURM and launch the MPI executable through srun command. How to do this for various MPI distributions is described at http://slurm.schedmd.com/mpi_guide.html. Some MPI distributions' mpirun commands integrate with Slurm and thus it is more convenient to use them instead of srun.
For MPI distributions at CHPC, the following works (assuming MPI program internally threaded with OpenMP).
Intel MPI
module load [intel,gcc] impi
# for a cluster with Ethernet only, set network fabrics to TCP
setenv I_MPI_FABRICS shm:tcp
# for a cluster with InfinBand, set network fabrics to OFA
setenv I_MPI_FABRICS shm:ofa
# on lonepeak owner nodes, use the TMI interface (InfiniPath)
setenv I_MPI_FABRICS shm:tmi
# IMPI option 1 - launch with PMI library - currently not using task affinity, use mpirun instead
setenv I_MPI_PMI_LIBRARY /uufs/CLUSTER.peaks/sys/pkg/slurm/std/lib/libpmi.so
#srun -n $SLURM_NTASKS $EXE >& run1.out
# IMPI option 2 - bootstrap
mpirun -bootstrap slurm -np $SLURM_NTASKS $EXE >& run1.out
MPICH2
Launch the MPICH2 jobs with mpiexec as explained in http://slurm.schedmd.com/mpi_guide.html#mpich2. That is:
module load [intel,gcc,pgi] mpich2
setenv MPICH_NEMESIS_NETMOD mxm # default is Ethernet, choose mxm for InfiniBand
mpirun -np $SLURM_NTASKS $EXE
OpenMPI
Use the mpirun command from the OpenMPI distribution. There's no need to specify the hostfile as OpenMPI communicates with Slurm in that regard. To run:
module load [intel,gcc,pgi] openmpi
mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # in case of Ethernet network cluster, such as general lonepeak nodes.
mpirun -np $SLURM_NTASKS $EXE # in case of InfiniBand network clusters
Note that OpenMPI supports multiple network interfaces and as such it allows for single MPI executable across all CHPC clusters, including the InfiniPath network on lonepeak.
MVAPICH2
MVAPICH2 executable can be launched with mpirun command (preferably) or with srun, in which case one needs to use --mpi=none flag. To run multi-threaded code, make sure to set OMP_NUM_THREADS and MV2_ENABLE_AFFINITY=0 (ensure that the MPI tasks don't get locked to single core) before calling the srun.
module load [intel,gcc,pgi] mvapich2
setenv OMP_NUM_THREADS 6 # optional number of OpenMP threads
setenv MV2_ENABLE_AFFINITY 0 # disable process affinity - only for multi-threaded programs
mpirun -np $SLURM_NTASKS $EXE # mpirun is recommended
srun -n $SLURM_NTASKS --mpi=none $EXE # srun is optional
x
Running shared jobs and running multiple serial calculations within one job
On July 1, 2019 node sharing was enabled on all CHPC clusters in the general environment. This is the best option when a user has a single job that does not need an entire node. For more information see the NodeSharing page.
Please note in cases where a user has many calculations, whether they need a portion of a node or the entire node, to submit at the same time, there may be better options than submitting each calculation as a separate batch submission; for these cases see our page dedicated to running multiple serial jobs for details as well as the next section on job arrays.
Multiple jobs using job arrays
Job arrays enable quick submission of many jobs that differ from each other only slightly by a some sort of index. In this case Slurm provides environment variable SLURM_ARRAY_TASK_ID which serves as a differentiator between the job. For example, if our program takes input data input.dat, we can have it running using 30 different input data stored in files input[1-30].dat using the following script, named myrun.slr:
#!/bin/tcsh
#SBATCH -J myprog # A single job name for the array
#SBATCH -n 1 # Number of tasks
#SBATCH -N 1 # All tasks on one machine
#SBATCH -p CLUSTER # Partition on some cluster
#SBATCH -A chpc # General CHPC account
#SBATCH -t 0-2:00 # 2 hours (D-HH:MM)
#SBATCH -o myprog%A%a.out # Standard output
#SBATCH -e myprog%A%a.err # Standard error
./myprogram input$SLURM_ARRAY_TASK_ID.dat
We then use the --array
parameter to run this script:
sbatch --array=1-30 myscript.sh
Apart from SLURM_ARRAY_TASK_ID which is an environment variable unique for each job
array job, notice also %A
and %a
, which represent the job id and the job array index, respectively. These can be used
in the sbatch parameters to generate unique names.
You can also limit the number of jobs that can be running simultaneously to "n" by adding a %n after the end of the array range:
sbatch --array=1-30%5 myscript.sh
When submitting applications that utilize less than the full CPU count per node, please make sure to use the shared partitions, to allow multiple array jobs on one node. For more information see the NodeSharing page. Also see our document detailing various ways for running multiple serial jobs.
Automatic restarting of preemptable jobs
The owner-guest
orfreecycle
queues tend to have quicker turnaround than general
queues. However, the guest jobs may get preempted. If one's job is checkpointed (e.g.
by saving particle positions and velocities in dynamics simulations, property values
and gradients in minimizations, etc), one can automatically restart a preempted job
following this strategy:
- Right at the beginning of the jobs script submit a new job with dependency on the
current job. This will ensure that the new job will be eligible for running only after
the current job is preempted (or finished). Save the new job submission information
into a file - this file contains a job ID of the new job, which we save to an environment
variable NEWJOB
sbatch -d afterany:$SLURM_JOBID run_ash.slr >& newjob.txt
set NEWJOB=`cat newjob.txt |cut -f 4 -d " "` - In the simulation output, include a file that lists the last checkpointed iteration,
time step, or other measure of the simulation progress. In our example below, we are
having a file called
inv.append
which, among other things, contains lines on simulation iterations, one per line. - In the job script, extract the iteration number from this file and put it into the
simulation input file (here called
inpt.m
). This input file will be used when the simulation is restarted. Since the simulation file does not exist at the very start of the simulation, the first job will not append the input file - and thus begin from the start.set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
if ($ITER != "") then
echo "restart=$ITER;" >> inpt.m
endif - Run the simulation, if the job gets preempted, the current job will end here. If it
runs through completion, then at the end of the job script, make sure to delete the
new job identified by environment variable NEWJOB that was submitted when this job
was started.
scancel $NEWJOB
In summary, the whole SLURM script (called run_ash.slr
) would look like this
#SBATCH all necessary job settings (partition, walltime, nodes, tasks)
#SBATCH -A owner-guest
# submit a new job dependent on the finish of the current job
sbatch -d afterany:$SLURM_JOBID run_ash.slr >& newjob.txt
# get this new job job number
set NEWJOB=`cat newjob.txt |cut -f 4 -d " "`
# figure out from where to restart
set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
if ($ITER != "") then
echo "restart=$ITER;" >> inpt.m
endif
# copy input files to scratch
# run simulation
# copy results out of the scratch
# delete the job if the simulation finished
scancel $NEWJOB
Handy Slurm Information
Slurm User Commands
Slurm Command | What it does |
---|---|
sinfo | reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options. |
squeue | reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order. |
sbatch | is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks. |
scancel | is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step. |
sacct | is used to report job or job step accounting information about active or completed jobs. |
srun | is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation. |
spart | list partitions and their utilization |
pestat | list efficiency of cluster utilization on per node, user or partition basis. By default
it prints utilization of all cluster nodes, to select only nodes utilized by an user,
run pestat -u $USER . This command is very useful in checking if your jobs are running efficiently. |
Useful Slurm aliases
Bash to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si="sinfo -o \"%20P %5D %14F %8z %10m %10d %11l %32f %N\""
alias si2="sinfo -o \"%20P %5D %6t %8z %10m %10d %11l %32f %N\""
alias sq="squeue -o \"%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""
Tcsh to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si 'sinfo -o "%20P %5D %14F %8z %10m %11l %32f %N"'
alias si2 'sinfo -o "%20P %5D %6t %8z %10m %10d %11l %32f %N"'
alias sq 'squeue -o "%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R"'
sview GUI tool
sview is a graphical user interface to view and modify Slurm state. Run it by typing sview. It is useful for viewing partitions and nodes characteristics and information on jobs. Right clicking on the job, node or partition allows to perform actions on them, though, use this carefully not to accidentally modify or remove your job.
Moab/PBS to Slurm translation
Moab/PBS to Slurm commands
Action | Moab/Torque | Slurm |
---|---|---|
Job Submission | msub/qsub | sbatch |
Job deletion | canceljob/qdel | scancel |
List all jobs in queue | showq/qstat | squeue |
List all nodes | sinfo | |
Show information about nodes | mdiag -n/pbsnodes | scontrol show nodes |
Job start time | showstart | squeue --start |
Job information | checkjob | scontrol show job <jobid> |
Reservation information | showres |
scontrol show res (this option shows details) sinfo -T |
Moab/PBS to Slurm environmental variables
Description | Moab/Torque | Slurm |
---|---|---|
Job ID | $PBS_JOBID | $SLURM_JOBID |
node list | $PBS_NODEFILE |
Generate a listing of 1 node per line: Generate alisting of 1 core per line: srun hostname | sort > nodefile.$SLURM_JOBID
|
submit directory | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
number of nodes | $SLURM_NNODES | |
number of processors (tasks) | $SLURM_NTASKS ($SLURM_NPROCS for backward compatibility) |
Moab/PBS to Slurm job script modifiers
Description | Moab/Torque | Slurm |
---|---|---|
Walltime | #PBS -l walltime=1:00:00 | #SBATCH -t 1:00:00 (or --time=1:00:00) |
Process count |
#PBS -l nodes=2:ppn=12 |
#SBATCH -n 24 ( or --ntasks=24) |
Memory | #PBS -l nodes=2:ppn=12:m24576 |
#SBATCH --mem=24576 it is also possible to specify memory per tash with --mem-per-cpu; also see constraint section above for additional infomraiton on the use of this. |
Mail options | #PBS -m abe |
#SBATCH --mail-type=FAIL,BEGIN,END |
Mail user | #PBS -M user@mail.com | #SBATCH --mail-user=user@mail.com |
Job name and STDOUT/STDERR |
#PBS -N myjob |
#SBATCH -o myjob-%j.out-%N NOTE: The %j and %N are replaced by the job number and the node (first node if a multi-node job. This gives the stderr and stdout a unique name for each job. |
Account | #PBS -A owner-guest optional in Torque/Moab |
#SBATCH -A owner-guest (or --account=owner-guest) |
Dependency | #PBS -W depend=afterok:12345 run after job 12345 finishes correctly |
#SBATCH -d afterok:12345 (or --dependency=afterok:12345) |
Reservation | #PBS -l advres=u0123456_1 |
#SBATCH -R u0123456_1 (or --reservation=u0123456_1) |
Partition | No direct equivalent |
#SBATCH -p lonepeak (or --partition=lonepeak) |
Propagate all environment variables from terminal |
#PBS -V | All environment variables are propagated by default, except for modules which are purged at a job start to prevent possible inconsistencies. One can either load the needed modules in the job script, or have them in their .custom.[sh,csh] file. |
Propagate specific environment variable |
#PBS -v myvar | #SBATCH --export=myvar use with caution as this will export ONLY variable myvar |
Target specific owner |
#PBS -l nodes=1:ppn=24:ucgd -A owner-guest | #SBATCH -A owner-guest -p kingspeak-guest -C "ucgd" |
Target specific nodes |
#SBATCH -w notch001,notch002 (or --nodelist=notch001,notch002) |
Information about job priority
Note that this applies to the general resources and not to owner resources on the clusters.
The first and most significant portion of a jobs priority are based on the account
being used and if it has allocation or not. Jobs run with allocation have a base
priority of 100,000. Jobs without have a base priority of 1.
To this, there are additional values added for:
(1) Age (time a job spends in the queue) -- For the "Age" of a job we will see a somewhat
linear growth of the priority until it hits a cap. The cap is a limit we put on how
much time a job can accrue extra priority in the queue.
(2) Fairshare (how much of the system you have used recently) -- Fairshare is a factor
based on the historical usage of a user. All things being equal the user that has
used the system the less recently should have a bonus to priority over the user that
has used the system more recently. This value though is somewhat more of an exponential
behavior as compared to the other two.
(3) JobSize (how many nodes/cores our job is requesting) -- Job size is again a linear
value according to the number of resources requested. It is fixed at submit time
according to the requested resources.
At any point you can run 'sprio' and see the current priority as well as the source
of the priority (in terms of the three components mentioned above) for all idle jobs
in the queue on a cluster.
UPDATE- 21 January 2020: With the new version of slurm installedearlier this month we have the ability to limit, on a per qos basis, the number of pending jobs per user that accrue priority based on the age factor. This limit has been set to 5. At the same time we have set a limit of the number of jobs a user can submit per qos to 1000 (was already set for a few of the qos'es but was changed to be set for each qos).
How to determine which Slurm accounts you are in
In order to see which accounts and partitions you can use, do:
sacctmgr -p show assoc user=<UNID>
The output of this command can be difficult to interpret; a wrapper to print a formatted version follows:
printf "%-15s%-25s%s\n" "Cluster" "Account" "Partition" && sacctmgr -p show assoc user=$USER | awk -F'|' 'NR>1 { printf "%-15s%-25s%s\n", $1, $2, $18 }' | sort
If you find yourself using this command often, you can create an alias by escaping some of the parameters. For example, in bash, add the following to your ~/.aliases file, then source ~/.aliases:
alias myslurmaccts='printf "%-15s%-25s%s\n" "Cluster" "Account" "Partition" && sacctmgr -p show assoc user=$USER | awk -F"|" "NR>1 { printf \"%-15s%-25s%s\n\", \$1, \$2, \$18 }" | sort'
There are a few exceptions to the account-partition mapping that will not be correct using the above method.
Another option is to use the myallocation
command that we have developed for this purpose. However, there is a chance that
we may have missed some of these exceptions in its logic, so, if you notice any irregularity
in you account-partition mapping, please, let us know.
How to log into the nodes where the job runs
Sometimes it is useful to connect to the nodes where a job runs, to for example monitor
if the executable is running correctly and efficiently. For that, the best way is
to ssh to the nodes that the job runs on. We allow users with jobs on compute nodes
to ssh to these compute nodes. First, find the nodes where the job runs with the squeue -u $USER
command, and then ssh to these nodes.
Other good sources of information
- http://slurm.schedmd.com/pdfs/summary.pdf This is a two page summary of common SLURM commands and options.
- http://slurm.schedmd.com/documentation.html Best source for online documentation
- http://slurm.schedmd.com/slurm.html
- http://slurm.schedmd.com/man_index.html
- man <slurm_command> (from the command line)
- http://www.glue.umd.edu/hpcc/help/slurm-vs-moab.html A more complete comparison table between slurm and moab
- http://www.schedmd.com/slurmdocs/rosetta.pdf is a table of slurm commands and their counterparts in a number different batch systems