You are here:

GPUs and Accelerators at CHPC

The CHPC has a limited number of cluster compute nodes with GPUs. The older nodes which have NVidia's Tesla M2090 GPU cards are part of the Ember cluster. The most recent nodes which have either Tesla K80, GeForce TitanX GPU, or Tesla P100 cards are part of the Kingspeak cluster. This document describes the hardware, as well as access and usage of these resources.

Hardware overview

Access and running jobs, node sharing

GPU programming environment

Installed GPU codes

Hardware Overview

CHPC has a limited number of GPU nodes on embers and kingspeak. 

Nodes   GPU type   GPU count     CPU core count       Memory   
Compute capability
kp359-kp362 NVIDIA P100 2 28 256GB 6.0
kp297 and kp298       GeForce TitanX    8 12 64GB 5.0
kp299 and kp300      Nvidia K80 8 12 64GB 3.5
em001-em011 Nvidia M2090 2 12 24GB 2.0

 

Kingspeak

Kingspeak has two nodes with four Tesla K80 cards each, two nodes with eight GeForce TitanX cards each and four nodes with two Tesla P100 cards each.

The K80 is of the Kepler architecture, released in late 2014. Each K80 card consists of two GPUs, each GPU having 12 GB of global memory. Thus the K80 nodes will show 8 total GPUs available. Peak double prevision performance of a single K80 card is 1864 GFlops. The K80 nodes have two 6-core Intel Haswell generation CPUs and 64 GB of RAM.

The GeForce TitanX is of the next generation Maxwell architecture, and also has 12 GB of global memory per card.  An important difference from the K80 is that it does not have very good double precision performance (max ca. 200 GFlops), but it has great single precision speed (~ 7 TFlops). The TitanX nodes should be used for either single precision, or mixed single-double precision GPU codes. The TitanX nodes have two 6-core Intel Haswell generation CPUs and 64 GB of host RAM.

The Tesla P100 is of the Pascal architecture, and has 16 GB of global memory per card. Each GPU card contains 56 multiprocessors with each 64 cores (3584 cores in total). The ECE support is (currently) disabled.The system interface is a PCIe Gen3 Bus. The double-precision performance per card is 4.7 TFlops. The single-precision is 9.3 TFlops anf the half-precision performance is 18.7 TFlops. The P100 nodes have each 2 14-core Intel Broadwell processors (E5-2680 v4 running @ 2.4 GHz) and 256 GB RAM.

Ember

The Ember cluster has eleven nodes which have two Tesla M2090 cards each. The M2090 is of the Fermi architecture (compute capability 2.0) that was released in 2011. Each card has 6 GB of global memory. Although relatively old, each card has a peak double precision floating point performance of 666 GFlops, still making it a good performer. The GPU nodes have two 6-core Intel Westmere generation CPUs and 24 GB of host RAM.

Access and running jobs, node sharing

The use of the GPU nodes does not affect your allocation  (i.e. their usage does not count against any allocation your group may have).  However, we have restricted them to users with a special GPU account, which needs to be requested. Please e-mail issues@chpc.utah.edu to request the GPU account. The GPU partition and account on ember are named ember-gpu, while on kingspeak the partition and account names are kingspeak-gpu for the nodes containing K80 and the GeForce TitanX cards. The nodes with NVIDIA P100 cards are owner nodes. Therefore, the partition name is kingspeak-gpu-guest. Its account name is owner-gpu-guest. The jobs on the NVIDIA P100 nodes may be subjected to preemption.

One has to request the GPUs via a list of generic consumable resources (a.k.a gres), that is #SBATCH --gres=gpu:k80:8. The gres notation is a colon separated list of resource_type:resource_name:resource_count. In our case, the resource_type is always gpu, the  resource_name is either m2090,k80or titanxor p100, and the resource_count is the number of GPUs per node requested - 1-8. Note that if you do not have the #SBATCH --gres=gpu:k80:8 line, your job will not be assigned any GPUs.

Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes,  node sharing has been enabled for the GPU partitions. This allows multiple jobs to run on the same node, each job being assigned specific resources (number of cores, amout of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum available on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system.  Therefore a job's performace can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.

Node sharing can be accessed by requesting less than the full number of gpus, core and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or memory, or all three. By default, each job gets  2 GB of memory per core requested (the lowest common denominator among our cluster nodes), therefore to request a different amount than the default amount of memory, you must use --mem flag . To request exclusive use of the node, use --mem=0.

Below is a list of useful job modifiers for use:

Option Explanation
#SBATCH --gres=gpu:k80:1 request one K80 GPU
 #SBATCH --mem=4G request 4 GB of RAM 
#SBATCH --mem=0

request all memory of the node; this option also
ensures node is in exclusive use by the job

#SBATCH --ntasks=1 requests 1 core 
#SBATCH --mem=0 request all cores of the node

 

An example script that would request two Ember nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:

#SBATCH --nodes=2
#SBATCH --mem=0

#SBATCH --partition=ember-gpu
#SBATCH --account=ember-gpu
#SBATCH --gres=gpu:m2090:2
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

 To request all 8 K80 GPUs on a kingspeak node, again using one GPU per MPI task, we would do:

#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --partition=kingspeak-gpu
#SBATCH --account=kingspeak-gpu
#SBATCH --gres=gpu:k80:8
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

As an example, using the script below  will get four GPUs,  four CPU cores, and 8GB of memory.  The remaining GPUs, CPUs, and memory will then be accessible for other jobs.

#SBATCH --time=00:30:00             
#SBATCH --nodes=1    
#SBATCH --ntasks=4
#SBATCH --gres=gpu:titanx:4
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu

The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.

#SBATCH --time=00:30:00             
#SBATCH --nodes=1    
#SBATCH --ntasks=14
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=100GB
#SBATCH --account=owner-gpu-guest
#SBATCH --partition=kingspeak-gpu-guest

To run interactive job, do not use the usual srun command, as this does not work correctly with the gres. Instead, use the salloc command, e.g.

salloc -n 1 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:1

This will allocate the resources to the job, but keeps the prompt on the interactive node. You can then use srun or mpirun commands to launch the calculation on the allocated compute node resources.

GPU programming environment

On all GPU nodes NVidia CUDA, PGI CUDA Fortran and the OpenACC compilers are installed. The default CUDA is 8.0, which at the time of writing is the most recent. The CUDA default installation is to be found at  /usr/local/cuda, or by simply loading the CUDA module, module load cuda. PGI compilers come with their own CUDA which is quite recent, and can be set up by loading the PGI module, module load pgi.

We recommend to get an interactive session on the compute node in order to compile a CUDA code, but, in a pinch, any interactive node should work as the CUDA is installed on the interactives as well. PGI compilers come with their own CUDA so compiling anywhere from where you can load the PGI module should work.

To compile CUDA code so that it runs on all the four types of GPUs that we have, use the following compiler flags:  -gencode arch=compute_20,code=sm_20 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60. For more info on the CUDA compilation and linking flags, please have a look at  http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.

The PGI compilers specify the GPU architecture with the-tp=tesla flag. If no further option is specified, the flag will generate code for all available computing capabilities (at the time of writing cc20, cc30, cc35 and cc50). To be specific for each GPU,-tp=tesla:cc20 can be used for the M2090, -tp=tesla:cc35 for the K80 and -tp=tesla:cc50 for the TitanX. To invoke the OpenACC, use -acc flag. More information on OpenACC can be obtained at http://www.openacc.org.

Good tutorials on GPU programming are available a the CUDA Education and Training site from NVidia.

When running the GPU code, it is worth checking the resources that the program is using, to ensure that the GPU is well utilized. For that, one can run the nvidia-smi command, and watch for the memory and CPU utilization. nvidia-smi is also useful to query and set various features of the GPU, see nvidia-smi --help for all the options that the command takes. For example, nvidia-smi -L  lists the GPU card properties. On the TitanX nodes:

 Titan Card: GPU 0: GeForce GTX TITAN X (UUID: GPU-cd731d6a-ee18-f902-17ff-1477cc59fc15)
Debugging

NVidia's CUDA distribution includes a terminal debugger named cuda-gdb. Its operation is similar to the GNU gdbdebugger. For details, see the cuda-gdb documentation.

For out of bounds and misaligned memory access errors, there is the cuda-memcheck tool. For details, see the cuda-memcheck documentation.

The Totalview debugger that we license also supports CUDA and OpenACC debugging. Due to its user friendly graphical interface we recommend using Totalview for GPU debugging. For information on how to use Totalview, see our debugging page.

Profiling

Profiling can be very useful in finding GPU code performance problems, for example inefficient GPU utilization, use of shared memory, etc. NVidia CUDA provides both command line (nprof) and visual profiler (nvvp). More information is in the CUDA profilers documentation.

 

 Installed GPU codes

 We have the following GPU codes installed:

Code name Module name  Prerequisite  modules Sample batch script(s) location Other notes
HOOMD hoomd

gcc/4.8.5

mpich2/3.2.g

/uufs/chpc.utah.edu/sys/installdir/hoomd/2.0.0g-[sp,dp]/examples/  
VASP vasp

intel

impi

cuda/7.5

/uufs/chpc.utah.edu/sys/installdir/vasp/examples Per group license, let us know if you need access
AMBER amber-cuda

gcc/4.4.7

mvapich2/2.1.g

 adapt CPU script  
LAMMPS lammps/10Aug15        intel/2016.0.109  impi/5.1.1.109 adapt CPU script  

 

If there is any other GPU code that you would like us installed, please let us know.

Some commercial programs that we have installed, such as Matlab, also have GPU support. Either try them or contact us.

Last Updated: 5/8/17