You are here:

GPUs and Accelerators at CHPC

The CHPC has a limited number of cluster compute nodes with GPUs. The older nodes which have Nvidia's Tesla M2090 GPU cards are part of the Ember cluster. The most recent nodes which have either Tesla K80, GeForce TitanX GPU, or Tesla P100 cards are part of the Kingspeak cluster. This document describes the hardware, as well as access and usage of these resources.

Hardware overview

Access and running jobs, node sharing

GPU programming environment

Installed GPU codes

Hardware Overview

CHPC has a limited number of GPU nodes on ember, kingspeak and notchpeak, as well as a standalone OpenPower server

Nodes   GPU type   GPU count     CPU core count       Host Memory   
Compute capability
cryosparc (restricted)* GTX1080Ti 4 32 128GB 6.1
notch060** GTX1080Ti 8 16 96GB 6.1
notch055*** Titan V 4 32 192GB 7.0
notch001-notch003 Nvidia V100 3 32 192GB 7.0
kp359-kp362*** Nvidia P100 2 28 256GB 6.0
kp297 and kp298       GeForce TitanX    8 12 64GB 5.0
kp299 and kp300      Nvidia K80 8 12 64GB 3.5
em001-em011 Nvidia M2090 2 12 24GB 2.0 Nvidia K80 2 16 Power8 256 GB 3.5


*The crypsoparc nodes are stand alone servers owned by individual research groups.  They are included here only for documentation purposes.

** notch060 is owned by a research group. Users outside the research group can only access the node in the guest mode (in a smilar manner used to access the non-GPU owner nodes).

*** notch055 is owned by a research group. Users outside the research group can only access the node in the guest mode (in a smilar manner used to access the non-GPU owner nodes).

****Note that the P100 GPU nodes are owned by the School of Computing (SOC), and users not from the SOC can only access these in the guest mode, in a similar manner used to access the non-GPU owner nodes.


Notchpeak has three nodes with three Tesla V100 cards each & one node with a Titan V card. 

The V100 is of the Volta architecture, released in late 2017. Each GPU having 16 GB of global memory.  Peak double prevision performance of a V100  is 7 TFlops. These nodes have two 16-core Intel skylake generation CPUs (Gold 6130 @ 2.10GHz) and 192 GB of RAM. We recommend to use CUDA >= 9.1 (Support for the 7.x compute capability). 

The Titan V card (Volta Architecture) has 80 Streaming Multiprocessors (SMPs). It contains 12GB HBM2 memory and 640 Tensor Cores. The device also contains 5120 CUDA Cores. The device has the following performance specifics: 6.9 TFlops (Double Precision), 13.8 TFlops (Single Precision), 27.6 (Half Precision) & 110 TFlops (Tensor Performance - Deep Learning). Please use CUDA >= 9.1 (Support for 7.x compute capability). 

Notch060 is a node with 16 cores (Intel Xeon Silver CPU) and 96GB memory. It contains 8 GPU cards (GeForce GTX 1080 Ti) belonging to the Pascal generation. 


Kingspeak has two nodes with four Tesla K80 cards each, two nodes with eight GeForce TitanX cards each and four nodes with two Tesla P100 cards each.

The K80 is of the Kepler architecture, released in late 2014. Each K80 card consists of two GPUs, each GPU having 12 GB of global memory. Thus the K80 nodes will show 8 total GPUs available. Peak double prevision performance of a single K80 card is 1864 GFlops. The K80 nodes have two 6-core Intel Haswell generation CPUs and 64 GB of RAM.

The GeForce TitanX is of the next generation Maxwell architecture, and also has 12 GB of global memory per card.  An important difference from the K80 is that it does not have very good double precision performance (max ca. 200 GFlops), but it has great single precision speed (~ 7 TFlops). The TitanX nodes should be used for either single precision, or mixed single-double precision GPU codes. The TitanX nodes have two 6-core Intel Haswell generation CPUs and 64 GB of host RAM.

The Tesla P100 is of the Pascal architecture, and has 16 GB of global memory per card. Each GPU card contains 56 multiprocessors with each 64 cores (3584 cores in total). The ECE support is (currently) disabled.The system interface is a PCIe Gen3 Bus. The double-precision performance per card is 4.7 TFlops. The single-precision is 9.3 TFlops anf the half-precision performance is 18.7 TFlops. The P100 nodes have each 2 14-core Intel Broadwell processors (E5-2680 v4 running @ 2.4 GHz) and 256 GB RAM.


The Ember cluster has eleven nodes which have two Tesla M2090 cards each. The M2090 is of the Fermi architecture (compute capability 2.0) that was released in 2011. Each card has 6 GB of global memory. Although relatively old, each card has a peak double precision floating point performance of 666 GFlops, still making it a good performer. The GPU nodes have two 6-core Intel Westmere generation CPUs and 24 GB of host RAM.

Access and running jobs, node sharing

The use of the GPU nodes does not affect your allocation  (i.e. their usage does not count against any allocation your group may have).  However, we have restricted them to users with a special GPU account, which needs to be requested. Please e-mail to request the GPU account. The GPU partition and account on ember are named ember-gpu. On kingspeak the partition and account names are kingspeak-gpu for the nodes containing K80 and the GeForce TitanX cards.  On notchpeak the partition and account are notchpeak-gpu.

The nodes with NVIDIA P100 cards are owner nodes. Therefore, the partition name is kingspeak-gpu-guest. Its account name is owner-gpu-guest. The jobs on the NVIDIA P100 nodes may be subjected to preemption.

The nodes notch055 & notch060 are owned by a research group. Users outside the group can use the GPU devices. The corresponding account name is owner-gpu-guest.The partition name is notchpeak-gpu-guest. 

One has to request the GPUs via a list of generic consumable resources (a.k.a gres), that is #SBATCH --gres=gpu:k80:8. The gres notation is a colon separated list of resource_type:resource_name:resource_count. In our case, the resource_type is always gpu, the  resource_name is either m2090k80 , titanxor p100, or v100 or titanv and the resource_count is the number of GPUs per node requested - 1-8. Note that if you do not have the #SBATCH --gres=gpu:k80:8 line, your job will not be assigned any GPUs. 

Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes,  node sharing has been enabled for the GPU partitions. This allows multiple jobs to run on the same node, each job being assigned specific resources (number of cores, amout of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum available on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system.  Therefore a job's performace can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.

Node sharing can be accessed by requesting less than the full number of gpus, core and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or memory, or all three. By default, each job gets  2 GB of memory per core requested (the lowest common denominator among our cluster nodes), therefore to request a different amount than the default amount of memory, you must use --mem flag . To request exclusive use of the node, use --mem=0.

When node sharing is on (default unless asking full number of GPUs, cores or memory), the SLURM scheduler automatically sets task to core affinity, mapping one task per physical core. To find what cores are bound to the job's tasks, run:

cat /cgroup/cpuset/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOB_ID/cpuset.cpus

Below is a list of useful job modifiers for use:

Option Explanation
#SBATCH --gres=gpu:k80:1 request one K80 GPU
#SBATCH --mem=4G request 4 GB of RAM 
#SBATCH --mem=0

request all memory of the node; this option also
ensures node is in exclusive use by the job

#SBATCH --ntasks=1 requests 1 core 
#SBATCH --mem=0 request all cores of the node


An example script that would request two Ember nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:

#SBATCH --nodes=2
#SBATCH --mem=0

#SBATCH --partition=ember-gpu
#SBATCH --account=ember-gpu
#SBATCH --gres=gpu:m2090:2
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

 To request all 8 K80 GPUs on a kingspeak node, again using one GPU per MPI task, we would do:

#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --partition=kingspeak-gpu
#SBATCH --account=kingspeak-gpu
#SBATCH --gres=gpu:k80:8
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

As an example, using the script below  will get four GPUs,  four CPU cores, and 8GB of memory.  The remaining GPUs, CPUs, and memory will then be accessible for other jobs.

#SBATCH --time=00:30:00             
#SBATCH --nodes=1    
#SBATCH --ntasks=4
#SBATCH --gres=gpu:titanx:4
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu

The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.

#SBATCH --time=00:30:00             
#SBATCH --nodes=1    
#SBATCH --ntasks=14
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=100GB
#SBATCH --account=owner-gpu-guest
#SBATCH --partition=kingspeak-gpu-guest

To run a parallel interactive job with MPI, do not use the usual srun command, as this does not work correctly with the gres. Instead, use the salloc command, e.g.

salloc -n 1 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:1

This will allocate the resources to the job, but keeps the prompt on the interactive node. You can then use srun or mpirun commands to launch the calculation on the allocated compute node resources.

For serial jobs, utilizing one or more GPUs, srun is functional, e.g.

srun -n 1 -N 1 -A owner-gpu-guest -p kingspeak-gpu-guest --gres=gpu:p100:1 --pty /bin/bash -l

GPU programming environment

On all GPU nodes Nvidia CUDA, PGI CUDA Fortran and the OpenACC compilers are installed. The default CUDA is 8.0, which at the time of writing is the most recent. The CUDA default installation is to be found at  /usr/local/cuda, or by simply loading the CUDA module, module load cuda. PGI compilers come with their own CUDA which is quite recent, and can be set up by loading the PGI module, module load pgi.

We recommend to get an interactive session on the compute node in order to compile a CUDA code, but, in a pinch, any interactive node should work as the CUDA is installed on the interactives as well. PGI compilers come with their own CUDA so compiling anywhere from where you can load the PGI module should work.

To compile CUDA code so that it runs on all the four types of GPUs that we have, use the following compiler flags:  -gencode arch=compute_20,code=sm_20 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70. For more info on the CUDA compilation and linking flags, please have a look at

The PGI compilers specify the GPU architecture with the-tp=tesla flag. If no further option is specified, the flag will generate code for all available computing capabilities (at the time of writing cc20, cc30, cc35, cc50 and cc60). To be specific for each GPU,-tp=tesla:cc20 can be used for the M2090, -tp=tesla:cc35 for the K80,-tp=tesla:cc50 for the TitanX and-tp=tesla:cc60 for the P100. To invoke the OpenACC, use -acc flag. More information on OpenACC can be obtained at

Good tutorials on GPU programming are available a the CUDA Education and Training site from Nvidia.

When running the GPU code, it is worth checking the resources that the program is using, to ensure that the GPU is well utilized. For that, one can run the nvidia-smi command, and watch for the memory and CPU utilization. nvidia-smi is also useful to query and set various features of the GPU, see nvidia-smi --help for all the options that the command takes. For example, nvidia-smi -L  lists the GPU card properties. On the TitanX nodes:

 Titan Card: GPU 0: GeForce GTX TITAN X (UUID: GPU-cd731d6a-ee18-f902-17ff-1477cc59fc15)

Nvidia's CUDA distribution includes a terminal debugger named cuda-gdb. Its operation is similar to the GNU gdbdebugger. For details, see the cuda-gdb documentation.

For out of bounds and misaligned memory access errors, there is the cuda-memcheck tool. For details, see the cuda-memcheck documentation.

The Totalview debugger that we license used to license and DDT debugger that we currently license also support CUDA and OpenACC debugging. Due to its user friendly graphical interface we recommend them for GPU debugging. For information on how to use DDT or Totalview, see our debugging page.


Profiling can be very useful in finding GPU code performance problems, for example inefficient GPU utilization, use of shared memory, etc. Nvidia CUDA provides both command line (nprof) and visual profiler (nvvp). More information is in the CUDA profilers documentation.


 Installed GPU codes

 We have the following GPU codes installed:

Code name Module name  Prerequisite  modules Sample batch script(s) location Other notes
HOOMD hoomd



VASP vasp




/uufs/ Per group license, let us know if you need access
AMBER amber-cuda



 adapt CPU script  
LAMMPS lammps/10Aug15        intel/2016.0.109  impi/ adapt CPU script  


If there is any other GPU code that you would like us installed, please let us know.

Some commercial programs that we have installed, such as Matlab, also have GPU support. Either try them or contact us.

Last Updated: 1/4/19