GPUs and Accelerators at the CHPC
The CHPC provides a range of compute nodes equipped with GPUs. These GPU-enabled devices are available on the granite, notchpeak, kingspeak, lonepeak, and redwood clusters (the latter being part of the Protected Environment). This document outlines the hardware specifications, access procedures, and usage guidelines for these resources.
On this page
The table of contents requires JavaScript to load.
GPU Hardware Overview
The CHPC provides GPU devices on the granite (grn), notchpeak (np), kingspeak (kp), lonepeak (lp), and redwood (rw) clusters. Please note that the redwood cluster is part of the Protected Environment (PE).
In Table 1, we provide a summary of the different types of GPU devices (NVIDIA) , i.e.:
- Architecture: the GPU generation or architecture to which a device belongs.
- Device: the model or name of the GPU.
- Compute Capability: defines the hardware features and supported instructions of the device.
- Global Memory: the amount of global memory available on the GPU.
- Area: recommended application area for the GPU.
- AI : indicates whether the GPU efficiently supports AI workloads.
- FP64: indicates whether the GPU supports native FP64 floating-point operations.
- Gres Type: The string to be specified with the SLURM
--gresoption to request this GPU type. - Cluster: The cluster(s) where this GPU type is available
| Architecture | Device | Compute Capability |
Global Memory |
Area | Gres Type |
Cluster |
|---|---|---|---|---|---|---|
| Blackwell | RTX PRO 6000 Blackwell Max-Q | 12.0 | 96 GB | AI | rtxpr6000bl | grn |
| RTX PRO 4000 Blackwell | 12.0 | 24 GB | AI | rtxpr4000bl | grn | |
| Grace-Hopper | GH200 | 9.0 | 96 GB | AI, FP64 | gh200 | grn |
| Hopper | H200 | 9.0 | 141 GB | AI, FP64 | h200 h200_1g.18gb h200_2g.35gb h200_3g.71gb |
grn, rw |
| H100 NVL | 9.0 | 96 GB | AI, FP64 | h100nvl h100 |
grn, np rw |
|
| Ada Lovelace | L40 S | 8.9 | 48 GB | AI | l40s | grn, np |
| L40 | 8.9 | 48 GB | AI | l40 | np | |
| L4 Tensor Core | 8.9 | 24 GB | AI | l4 | grn | |
| RTX 6000 Ada | 8.9 | 48 GB | AI | rtx6000 | grn, np, rw | |
| RTX 5000 Ada | 8.9 | 32 GB | AI | rtx5000 | grn, rw | |
| RTX 4500 Ada | 8.9 | 24 GB | AI | rtx4500 | grn | |
| RTX 4000 Ada | 8.9 | 20 GB | AI | rtx4000ada | grn | |
| RTX 2000 Ada | 8.9 | 16 GB | AI | rtx2000 | grn | |
| Ampere | A100-SXM4-80GB | 8.0 | 80 GB | AI, FP64 | a100 | np |
| A100-PCIE-80GB | 8.0 | 80 GB | AI, FP64 | a100 | np, rw | |
| A100-SXM4-40GB | 8.0 | 40 GB | AI, FP64 | a100 | rw | |
| A100-PCIE-40GB | 8.0 | 40 GB | AI, FP64 | a100 | np | |
| A30 | 8.0 | 24 GB | AI, FP64 | a30 | rw | |
| A800 40GB Active | 8.0 | 40 GB | AI, FP64 | a800 | grn, np | |
| RTX A6000 | 8.6 | 48 GB | AI | a6000 | np | |
| A40 | 8.6 | 48 GB | AI | a40 | np, rw | |
| GeForce RTX 3090 | 8.6 | 24 GB | AI | 3090 | np | |
| RTX A5500 | 8.6 | 24 GB | AI | a5500 | np | |
| Turing | GeForce RTX 2080 Ti | 7.5 | 11 GB | AI | 2080ti | np |
| Tesla T4 | 7.5 | 16 GB | AI | t4 | np | |
| Volta | V100-PCIE-16GB | 7.0 | 16 GB | AI, FP64 | v100 | np |
| TITAN V | 7.0 | 12 GB | AI, FP64 | titanv | np | |
| Pascal | Tesla P100-PCIE-16GB | 6.0 | 16 GB | AI, FP64 | p100 | kp |
| Tesla P40 | 6.1 | 24 GB | AI | p40 | np | |
| GeForce GTX 1080 Ti | 6.1 | 11 GB | AI | 1080ti | lp, rw | |
| Maxwell | GeForce GTX Titan X | 5.2 | 12 GB | AI | titanx | kp |
A detailed description of the features of each GPU type can be found here.
Getting access to CHPC's GPUs
To access CHPC GPU resources, you must have an active CHPC account. Access levels vary by cluster and node type:
1. General Nodes (CHPC-Owned)
-
Free Access: Available to all users on Notchpeak, Kingspeak, Lonepeak, and Redwood.
-
Allocation Required: Accessing Granite requires a formal allocation. If exhausted, "freecycle" mode is available but subject to preemption (interruption).
2. Owner Nodes (Group-Owned)
-
Guest Access: All users can use idle owner nodes, but jobs will be preempted if the owner needs the resource.
-
Owner Access: Granted to members of the purchasing group. Collaborators may request access via the PI of that owner node emailing helpdesk@chpc.utah.edu
3. Specialized Access
-
One-U Responsible AI (RAI): University researchers with AI projects can request owner-mode access by emailing the helpdesk@chpc.utah.edu.
-
Shared-Short Partitions:
notchpeak-shared-shortandredwood-shared-shortprovide non-preemptible access for small jobs (max 8 cores, 8-hour limit)
Running SLURM jobs using GPUs.
Access to GPU-enabled nodes is managed through SLURM job scheduling. You can access these nodes in three ways:
- By submitting a SLURM batch script using sbatch
- Interactively using salloc
- By using the CHPC's Open OnDemand web portal
The former two methods are described in detail in our SLURM documentation page. However, using GPUs requires specific SLURM options, and some standard options must be configured differently:
Required SLURM Flags:
--gres: Mandatory to request actual GPU hardware.-
--gres=gpu[:<gres_type>]:<num_devices> -
<gres_type>(optional): Allows you to specify a specific model of GPU (e.g.,a100orrtx6000). If you leave this out, SLURM will assign any available GPU type in the partition. -
<num_devices>(optional): Specifies how many GPUs you need. If omitted, the system defaults to one GPU.
-
-
--account&--partition: Must be set to GPU-specific values. -
--qos: Mandatory only when using the granite cluster. -
--mem: Optional; use only if your job has specific memory requirements.
MIG (Multi-Instance GPU support)
NVIDIA's MIG technology allows a single physical GPU to be partitioned into multiple isolated instances. Each MIG has its own memory, cache, and compute cores. A single GPU can support up to 7 MIGs.
MIG configurations available on notchpeak and redwood clusters include:
gres_type |
Memory per MIG | Max MIGs per node |
h200_1g.18gb |
18 GB |
56 |
h200_2g.35gb |
35 GB | 16 |
h200_3g.71gb |
71 GB | 8 |
Tip: If you don't care about the specific model and just want the first available GPU, use:
#SBATCH --gres=gpu:1
Cluster-Wide GPU Overview
To view all partitions and their associated features and gres_type values across all CHPC clusters (including both the general and protected environments),
use the Slurm alias shortcut :
si2 -M all -a
The --mem Slurm option
By default, each job is allocated 2 GB of memory per requested core, which is the lowest common denominator across CHPC clusters. If you need more memory,
use the --mem option to specify the desired amount.
Most jobs will not need a whole nodes resources but to request exclusive use of the node, set: --mem=0
Note: Open OnDemandsimplifies the user experience. Once a user selects a combination of account, partition, and QoS, the list of available GPUs associated with that selection is automatically populated. The amount of CPU memory can still be adjusted as needed.
Interactive Jobs
The following command requests interactive access on the notchpeak-gpu partition using
two 3090 GPUs. The CPU memory allocated is 2 GB per task. If you need more memory,
add the --mem flag.
salloc --ntasks=2 --nodes=1 --time=01:00:00 --partition=notchpeak-gpu --account=notchpeak-gpu --gres=gpu:3090:2
Best Practices
1.How to select the right GPU
CHPC clusters offer a diverse range of GPU architectures. To select the most efficient resource for your research, consider these four primary factors:
1. Numerical Precision (FP64)
-
The Question: Does your code require double-precision operations?
-
The Recommendation: For scientific simulations requiring high mathematical accuracy, choose GPUs with native FP64 support (labeled "FP64" in Table 1). AI and deep learning tasks typically rely on FP32 or FP16 and do not require this.
2. Global Memory Requirements
-
The Question: How much VRAM does your application need?
-
The Recommendation: Memory demand is driven by dataset size and kernel complexity. In Deep Learning, your choice will be influenced by model parameters, batch sizes, and numerical precision. Ensure your selected GPU has enough global memory to prevent "Out of Memory" (OOM) errors.
3. AI Acceleration (Tensor Cores)
-
The Question: Can your application leverage Tensor Cores?
-
The Recommendation: If you are performing deep learning training or inference, prioritize GPUs with Tensor Core acceleration to significantly reduce computation time. Have a look at the hardware page to identify GPUs that support Tensor Core acceleration.
4. Compute Capability & Architecture
-
The Question: Does your software require a specific GPU generation?
-
The Recommendation: Some modern libraries require newer architectures. Check the Compute Capability (CC) listed in Table 1 to ensure compatibility with your software requirements.
Selection Tools
Once you have identified your requirements, use the following tools to find and target the right hardware:
-
siorsi2commands: Run these in your terminal to view availablegres_typevalues and see exactly which nodes currently host those GPUs. -
--gresstring: Use the specific identifier (e.g.,gpu:a100:1) in your SLURM script to request your chosen hardware.
Pro Tip: If your job doesn't have strict hardware requirements, using a generic
--gres=gpu:1request will often result in a shorter queue time by allowing SLURM to pick the first available device.
2.Which GPUs are currently idle?
Use the freegpus utility to find idle resources in real-time.
-
Command:
freegpus(Scans all clusters by default). -
Output: Displays the hardware type and the count of idle units, formatted as
type (x count).
Note: A GPU may appear idle but remain unavailable if the node's host memory (RAM) is already fully utilized by other processes.
For advanced usage and flags, run freegpus --help
3. Verifying GPU Allocation and Detection
NVIDIA provides the NVIDIA System Management Interface program (nvidia-smi), a powerful command-line tool built on the NVIDIA's NVML library. It offers a wide range of options for monitoring and managing GPU resources. Use the following commands to inspect your allocation:
| Command | Purpose |
|
|
Lists all NVIDIA GPUs in the SLURM job/on the system along with their UUIDs (hardware identifiers). |
nvidia-smi -q |
Displays detailed information about each GPU, including usage, temperature, and memory. |
nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu --format=csv |
Outputs the GPU name, total memory, memory usage, and utilization percentage in CSV format. |
Checking Environment Variables
SLURM uses the CUDA_VISIBLE_DEVICES environment variable to restrict your job’s access to specifically assigned GPUs.
-
How to check: Run
echo $CUDA_VISIBLE_DEVICESin your terminal or script. -
What it means:
-
List of IDs (e.g.,
0,1): Confirms these specific local GPU IDs are allocated to your job. -
Empty Output: Indicates that no GPUs have been allocated to the current session.
-
The non-empty CUDA_VISIBLE_DEVICES output and the nvidia-smi -L listing confirm that GPUs have been successfully allocated.
Note: In addition to CUDA_VISIBLE_DEVICES, there are other CUDA-related environment variables that may be relevant. For a complete
list, please have a look here.
4. How to Monitor...
Multiple GPU Jobs on a Shared Node
Each GPU job runs within its own cgroup. When a user has multiple GPU jobs running on the same node, logging into that node via SSH will place the user inside the cgroup of one of those jobs. As a result, tools like nvidia-smi only shows information for that specific job, making it impossible to monitor the
others.
To check the status of a different job, use the following command:
srun --pty --overlap --jobid $JOBID /usr/bin/nvidia-smi
Replace $JOBID with the job ID of the job you want to monitor
GPU Performance within your SLURM Job
To monitor how effectively your job utilizes the GPU, add the following line to your
SLURM script
before launching your program executable:
/uufs/chpc.utah.edu/sys/installdir/chpcscripts/gpu/gpu-monitor.sh &
This command will create a file named $SLURM_JOBID.gpulog , where GPU utilization data is recorded every 5 minutes.
GPU Programming Environment and Performance
NVIDIA CUDA Toolkit
The NVIDIA CUDA Toolkit includes both the CUDA drivers and the CUDA Development Environment. Key components of the development environment include:
To explicitly load a specific version of the CUDA development environment, use the
following command:
module load cuda/<version>
To view all available CUDA versions, run:
module spider cuda
Compiling with nvcc
The standard NVIDIA C/C++ compiler is called nvcc.
As shown earlier (see Table 1), we listed all CHPC GPU devices along with their architecture and generation. When
compiling CUDA code, it’s important to target the appropriate GPU architectures. Use
the following nvcc compiler flags to support all the current GPU architectures at CHPC:
-gencode arch=compute_52,code=sm_52 \
-gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 \-gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 \-gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 \.
-gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90
For more info on the CUDA compilation and linking flags, please have a look at CUDA C++ Programming Guide.
Note: The Maxwell, Pascal, and Volta architectures are now feature-complete. While CUDA 12.x still supports building applications for these architectures, offline compilation and library support will be removed in the next major CUDA Toolkit release.
NVIDIA HPC Software Development Kit (SDK)
We offer the NVIDIA HPC SDK, a comprehensive toolkit for developing high-performance computing (HPC) applications that run on both CPUs and GPUs, supporting standard programming models like Fortran, OpenACC, OpenMP, and MPI.
The NVIDIA HPC SDK can be accessed by running:
module load nvhpc
Compiling using nvc, nvc++, nvfortran
The legacy PGI compilers (pgcc, pgc++, pgf90) have been succeeded by nvc, nvc++, and nvfortran. These modern compilers support C11, C++17, and Fortran 2003/2008, specifically optimized
for NVIDIA GPUs and multicore CPUs via OpenACCand OpenMP.
Compilation Example
To compile CUDA code (.cu) while targeting specific GPU architectures (Compute Capabilities), use the following
syntax:
Essential Flags
-
-cuda: Enables CUDA compilation. -
-fast: Triggers aggressive performance optimizations. -
-Minfo=accel: Provides detailed feedback on accelerator optimizations. -
-acc: Required when combining CUDA with OpenACC.
Libraries
Nvidia HPC SDK includes a suite of Math libraries for offloading computations to the GPUs and Communication librariesfor high-speed data exchange between multiple GPUs.The math libraries are located in the math_libs subdirectory, while the communication libraries can be found in the comm_libs subdirectory.
To compile and link with, for example, cuBLAS, use the following flags:
- Compilation line:
-I$NVROOT/math_libs/include - Linking line:
-L$NVROOT/math_libs/lib64 -Wl,-rpath=$NVROOT/math_libs/lib64 -lcublas
Debugging
The NVIDIA HPC SDK and CUDA distributions include a terminal-based debugger called cuda-gdb, which operates similarly to the GNU gdb debugger. For more information, see the cuda-gdb documentation.
To enable debugging information:
- For host (CPU) debugging:
nvc -g -o your_program your_code.c - For device (GPU) debugging:
nvc -cuda -g -G -o your_program your_code.cu
To detect out-of-bounds and misaligned memory access errors, use the cuda-memcheck tool. Detailed usage instructions can be found in the cuda-memcheck documentation.
We also license the DDT debugger, which supports CUDA and OpenACC debugging. Due to its user-friendly graphical interface, we recommend DDT for GPU debugging. For guidance on using DDT, please look at our debugging page.
Profiling
Profiling is a valuable technique for identifying GPU performance issues, such as inefficient
GPU utilization or suboptimal use of shared memory. NVIDIA CUDA provides a visual
profiler called Nsight Systems (nsight-sys), and command-line profiler (ncu).
Note:
We use the GPU hardware performance counters for GPU monitoring, which prevents profiling
by default. This issue typically manifests with the following error message:
$ ncu ./my-gpu-program
==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
To enable profiling, we included additional gres option, nsight=1, to the GPU request. For example:
salloc -N 1 -n 4 -A owner-gpu-guest -p notchpeak-gpu-guest -t 1:00:00 --gres=gpu:a40:1,nsight:1
In Open OnDemand interactive apps form, check the "Enable GPU profiling" checkbox which appears when a GPU type is chosen.