job-preemption - Center for High Performance Computing

What is preemption?

Preemption, or more specifically job preemption, refers to a running job being cancelled by Slurm because a higher-priority job has taken its place. Not all jobs are at risk of preemption, as preemption only applies to certain partitions. There may be a few questions you have regarding preemption, detailed below:

Is my job at risk of preemption?

A: Only jobs on particular partitions, such as the owner-guest partition or college-owned nodes, can be at risk of preemption. Even if your job is at risk of preemption, that does not guarantee that your job will be preempted by another job. For instance, we have many users successfully complete jobs on owner nodes using the owner-guest partitions.

What partitions are at risk of preemption?

A: The most common partition across each cluster at risk of preemption are the owner-guest partitions, as the users that own the nodes have priority of their nodes. The CHPC works in a 'condominium-style' mode. This means that the CHPC aims to have all nodes available to run jobs, including owner nodes, in order to limit long queue times. The owners of the node have priority access to it. If the owners of the node(s) are not using said node(s) for jobs currently, all CHPC users can run jobs on the owner nodes as a guest. However, these guest jobs run at a lower priority than the owner jobs. If the owner submits a job, it preempts any currently running jobs. The CHPC provides usage summaries for owner nodes over the previous two weeks. This information can be used to make informed decisions about submitting to the owner-guest partitions, as well as allowing users to pair this information with Slurm constraints.

The other most common partitions that users could run a risk of preemption are partitions of nodes owned by colleges, such as the University of Utah's School of Computing's purchased nodes hosted by the CHPC. If you are a part of a partition of nodes owned by a group and have a question about the preemption logic set in place by Slurm, please contact us at helpdesk@chpc.utah.edu. We would be happy to help answer your questions.

What factors increase my chances of preemption?

A: The biggest factor that increases your chance of preemption is length of job, as a user with a higher priority job could submit a job at any time. However, we have had many users successfully complete jobs on owner nodes using the owner-guest partitions. If submitting jobs to the owner-guest partitions, please take a look at the usage sumaries for owner nodes.

What determines a higher priority job?

A: In relation to preemption, the biggest factor on the owner-guest partitions are members of the group who owns the node(s). Those members have higher priority than guest jobs.

As it relates to the college-owned nodes (i.e., the University of Utah School of Computing), priority is determined based on the QOS associated with the partition. QOS does not apply to everybody and more information on Slurm QOSs can be found here or by contacting the CHPC at helpdesk@chpc.utah.edu.

Will my job resume automatically?

A: Not all software allows for jobs to resume mid-run. If the software you are using does allow for this, you can find information on setting up your Slurm script to automatically re-queue your job at its stopping point here.

Owner Nodes

One service that the CHPC provides to our users is node management. Often, a PI will contact the CHPC and ask if they could purchase a node or set of nodes through the CHPC that the CHPC hosts. These nodes become part of our clusters and are set into specific partitions: <pi-name>-<cluster-abbreviation>, <cluster>-guest, and <cluster>-gpu-guest. The members of the group that purchased the nodes have access to the <pi-name>-<cluster-abbreviation> partition. All other users of the CHPC can have access to those same nodes through the <cluster>-guest and <cluster>-gpu-guest partitions.

Users whose group do not own nodes can access nodes owned by other groups through the <cluster>-guest and <cluster>-gpu-guest partitions. Jobs submitted to the aforementioned partitions will run on the first available nodes that match the resource requirements of your jobs.

The CHPC keeps metrics for the utilization of owner nodes by the owners over the past two weeks. It should be noted that these graphs do not report owner node utilization by guest jobs. However, owner node utilization can be useful when trying to determine how long your job may wait in the queue. Additionally, users can take this information further by pairing it with Slurm constraints to request specific owner nodes. Information on using constraints to target specific owner nodes can be found below.

Constraints

Users can use Slurm constraints to target specific group nodes that have low owner use as a guest in order to reduce chances of being preempted. For example, to target nodes used by owner group "ucgd", we can do #SBATCH -C "ucgd". Historical usage (the past 2 weeks) of different owner node groups can be found at CHPC's constraint suggestion page.

Multiple constraints can be specified at once with logical operators in Slurm directives. This can allow for submission to nodes owned by one of several owner groups at a time (which might help reduce queue times and increase the number of nodes available) as well as the specification of exact core counts and available memory.

To select from multiple owner groups' nodes, use the "or" operator; a directive like #SBATCH -C "group1|group2|group3" will select from nodes in any of the constraints listed. By contrast, the "and" operator can be used to achieve further specificity in requests. To request nodes owned by a group and with only some amount of memory, for example, a directive like #SBATCH -C "group1&m256" could be used. This will only work where the combination is valid and multiple node features are associated with the nodes. To view the available node features, the sinfo aliases si and si2documented on the Slurm page are helpful.

When using the Slurm job manager in Open OnDemand, enter only the constraint string into the Constraints text entry, e.g. "group1|group2|group3" .

Automatic Restarting of Preemptable Jobs

The owner-guest or freecycle queues tend to have quicker turnaround than general queues. All users can submit to owner nodes. However, the guest jobs may get preempted, meaning that if the lab that owns the node begins a job on that node, your job will automatically be cancelled.

The preempted jobs can be re-queued by adding the --requeue SLURM option in the job script. However, bear in mind that the job will start from the beginning, unless the calculation is capable of checkpointing and restarting from the checkpoint.

If the job is checkpointed, a user can automatically restart a preempted job following this strategy:

Include the --requeue SLURM option in the SLURM parameters at the top of the job script:
```
#SBATCH --requeue
```
In the simulation output, include a file that lists the last checkpointed iteration, time step, or other measure of the simulation progress. In our example below, we are having a file called inv.append which, among other things, contains lines on simulation iterations, one per line.
In the job script, extract the iteration number from this file and put it into the simulation input file (here called inpt.m). This input file will be used when the simulation is restarted. Since the simulation file does not exist at the very start of the simulation, the first job will not append the input file - and thus begin from the start.
```
set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
if ($ITER != "") then
 echo "restart=$ITER;" >> inpt.m
endif
```
Run the simulation, if the job gets preempted, a new job gets queued and will start from the last ITER iteration that was checkpointed.

In summary, the whole SLURM script (called run_ash.slr) would look like this:

#SBATCH all necessary job settings (partition, walltime, nodes, tasks)
#SBATCH -A owner-guest
#SBATCH --requeue

# figure out from where to restart
set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
if ($ITER != "") then
 echo "restart=$ITER;" >> inpt.m
endif

# copy input files to scratch
# run simulation
# copy results out of the scratch

More Information on Slurm

Looking for more information on running Slurm at the CHPC? Check out these pages. If you have a specific question, please don't hesitate to contact us at helpdesk@chpc.utah.edu.

Setting up a Slurm Batch Script or Interactive Job Session

Slurm Priority Scoring for Jobs

MPI with Slurm

GPUs with Slurm

Running Independent Serial Calculations with Slurm

Accessing CHPC's Data Transfer Nodes (DTNs) through Slurm

Other Slurm Constraint Suggestions and Owner Node Utilization

Sharing Nodes Among Jobs with Slurm

Personalized Slurm Queries

Moab/PBS to Slurm