Slurm Interactive Sessions with salloc

This documentation page provides instructions for launching and managing interactive computing sessions on CHPC clusters. It explains how to use the salloc command to request immediate access to compute nodes, highlights a dedicated partition (notchpeak-shared-short) designed for shorter interactive tasks, and details methods for monitoring real-time job performance.

On this page

The table of contents requires JavaScript to load.

Slurm Directives

To request interactive compute resources through Slurm, you must pass Slurm directives as flags to the salloc command. These Slurm directives will define the computational requirements of your work, which Slurm then uses to determine which resources to assign to your job.

The different Slurm directives are as below:

--account=<youraccount>

Specifies the allocation account (often tied to a research group or class) that will be charged for the resources consumed by this job.

If your group has owner nodes, the account is usually <unix_group>-<cluster_abbreviation> (where cluster abbreviation is np, kp, lp, rw).

There are other types of accounts not associated with a research group. These are typically named for specific partitions, for example owner-guest and <cluster>-gpu accounts.

--partition=<yourpartition>

Designates the cluster queue or partition where the job will run (e.g., lonepeak is a specific set of nodes/resources).

Naming mechanisms include cluster, cluster-gpu, cluster-gpu-guest, cluster-guest, and pi-cl, where cluster is the full name of the cluster and cl is the abbreviated form.

We have our partition names described here.

--qos=<yourqos>

Specifies the appropriate QoS for your partition.

Most partitions only have one QoS, but, depending on use cases, there may be multiple QoS's for one partition. A QoS helps specify the type of resources you are accessing and your level of access to these resources. For instance, one QoS may have a longer max walltime than another QoS.

Unsure which Slurm account, partition, and QoS combination to use? Use the command mychpc batch to list the options available to you.

Try using our tool that helps users find which accounts, partitions and qualities of service you can use when submitting jobs on Center for High Performance Computing systems.

--time=DD-HH:MM:SS

Sets the maximum wall-clock time the job is allowed to run. If the job exceeds this limit, it will be automatically terminated.

DD - Days, HH - Hours, MM - Minutes, SS - Seconds

*Note*There is a walltime limit of 72 hours for jobs on general cluster nodes and 14 days on owner cluster nodes. If your job requires more time than these hard limits, you can email the CHPC at helpdesk@chpc.utah.edu, providing the job ID, cluster, and length of time you would like to extend the job to.

--ntasks=<number-of-cpus>

Requests the total number of cores the job will use. This is commonly used for parallel jobs and is often related to the number of CPUs requested.

--mem=<size>[units]

Requests the total amount of physical memory (RAM) required for the entire job.

--nodes=<number-of-nodes>

This directive specifies the minimum and maximum number of compute nodes required for the job. It tells the scheduler to allocate resources across the specified number of individual physical computers in the cluster.

--gres=gpu:<num-gpus>:<type-gpu>

To utilize a GPU (after specifying a GPU-containing partition), it is necessary that you add #SBATCH --gres=gpu.

Optionally, you can specify the number of GPUs you require by replacing <num-gpus>. You can also optionally specify a particular type of GPU that you require with <type-gpu>.

--constraint

The constraint flag is useful when you require a very specific set of compute resources.

For instance, if you require nodes with a specific architecture, amount of memory, number of cores, or want to target certain owner nodes using an owner-guest partition.

-o slurmjob-%j.out-%N

Specifies the file path for the job's standard output (stdout). The %j and %N variables are replaced with the job ID and the name of the first node used, respectively.

-e slurmjob-%j.err-%N

Specifies the file path for the job's standard error (stderr). Any error messages will be directed to this file, using the job ID and node name for clarity.

Starting an Interactive Session with salloc

Submitting for an interactive job via Slurm's salloc command works by passing Slurm directives as flags. For reference, these are the same directives you can utilize in a Slurm batch script, just without using #SBATCH.

Below is an example where someone in the baggins account requests interactive access to lonepeak with 2 cores across 1 node.

salloc --time=02:00:00 --ntasks 2 --nodes=1 --account=baggins --partition=lonepeak

The salloc flags can be abbreviated as:

salloc -t 02:00:00 -n 2 -N 1 -A baggins -p lonepeak

Running the salloc command above will automatically ssh you into the compute node, allowing you to complete your work interactively.

Notchpeak-Shared-Short and Redwood-Shared-Short Partitions

CHPC cluster queues tend to be very busy; it may take some time for an interactive job to start. For this reason, we have added two nodes to a special partition, notchpeak-shared-short, on the notchpeak cluster. These nodes are geared more towards interactive work. Job limits on this partition are 8 hours wall time, a maximum of ten submitted jobs per user, with a maximum of two running jobs with a maximum total of 32 tasks and 128 GB memory.

There is an equivalent partition in the Protected Environment called redwood-shared-short.

To access these special partitions, request both an account and partition under this name, e.g.:

salloc -N 1 -n 2 -t 2:00:00 -A notchpeak-shared-short -p notchpeak-shared-short

Logging Onto Computational Nodes: Checking Job Stats

Sometimes it is useful to connect to the node(s) where a job runs to monitor the executable and determine if it is running correctly and efficiently. For that, we allow users with active jobs on compute nodes to ssh to these compute nodes. To determine the name of your compute node, run the squeue -u $USER command, and then ssh to the node(s) listed.

Once logged onto the compute node, you can run the top command to view CPU and memory usage of the node. If using GPUs, you can view GPU usage through the nvidia-smi command.