CHPC has a limited number of GPU nodes on embers and kingspeak.
|Nodes||GPU type||GPU count||CPU core count|| Memory
|kp297 and kp298||GeForce TitanX||8||12||64GB||5.0|
|kp299 and kp300||Nvidia K80||8||12||64GB||3.5|
Kingspeak has two nodes with four Tesla K80 cards each, and two nodes with eight GeForce TitanX cards each.
The K80 is of the Kepler architecture, released in late 2014. Each K80 card consists of two GPUs, each GPU having 12 GB of global memory. Thus the K80 nodes will show 8 total GPUs available. Peak double prevision performance of a single K80 card is 1864 GFlops. The K80 nodes have two 6-core Intel Haswell generation CPUs and 64 GB of RAM.
The GeForce TitanX is of the next generation Maxwell architecture, and also has 12 GB of global memory per card. An important difference from the K80 is that it does not have very good double precision performance (max ca. 200 GFlops), but it has great single precision speed (~ 7 TFlops). The TitanX nodes should be used for either single precision, or mixed single-double precision GPU codes. The TitanX nodes have two 6-core Intel Haswell generation CPUs and 64 GB of host RAM.
The Ember cluster has eleven nodes which have two Tesla M2090 cards each. The M2090 is of the Fermi architecture (compute capability 2.0) that was released in 2011. Each card has 6 GB of global memory. Although relatively old, each card has a peak double precision floating point performance of 666 GFlops, still making it a good performer. The GPU nodes have two 6-core Intel Westmere generation CPUs and 24 GB of host RAM.
The use of the GPU nodes does not affect your allocation (i.e. their usage does not
count against any allocation your group may have). However, we have restricted them
to users with a special GPU account, which needs to be requested. Please e-mail email@example.com to request the GPU account. The GPU partition and account are named
cluster-gpu. In particular, on ember these are
ember-gpu, while on kingspeak they are
One has to request the GPUs via a list of generic consumable resources (a.k.a gres),
#SBATCH --gres=gpu:k80:8. The gres notation is a colon separated list of
resource_type:resource_name:resource_count. In our case, the
resource_type is always
resource_name is either
titanx, and the
resource_count is the number of GPUs per node requested - 1-8. Note that if you do not have the
#SBATCH --gres=gpu:k80:8 line, your job will not be assigned any GPUs.
Some programs are serial, or able to run only on a single GPU; other jobs perform
better on a single or small number of GPUs and therefore cannot efficiently make use
of all of the GPUs on a single node. Therefore, in order to better utilize our GPU
nodes, node sharing has been enabled for the partitions
cluster-gpu.This allows multiple jobs to run on the same node, each job being assigned specific
resources (number of cores, amout of memory, number of accelerators). The node resources
are managed by the scheduler up to the maximum available on each node. It should be
noted that while efforts are made to isolate jobs running on the same node, there
are still many shared components in the system. Therefore a job's performace can
be affected by the other job(s) running on the node at the same time and if you are
doing benchmarking you will want to request the entire node even if your job will
only make use of part of the node.
Node sharing can be accessed by requesting less than the full number of gpus, core
and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or
memory, or all three. In order to specify all cores, you can use the . In order
to specify all of the memory By default, each job gets 2 GB of memory per core requested (the lowest common denominator
among our cluster nodes), therefore to request a different amount than the default
amount of memory, you must use
--mem flag . To request exclusive use of the node, use
Below is a list of useful job modifiers for use:
|#SBATCH --gres=gpu:k80:1||request one K80 GPU|
|#SBATCH --mem=4G||request 4 GB of RAM|
request all memory of the node; this option also
|#SBATCH --tasks=1||requests 1 core|
|#SBATCH --mem=0||request all cores of the node
An example script that would request two Ember nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
To request all 8 K80 GPUs on a kingspeak node, again using one GPU per MPI task, we would do:
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
As an example, using the script below will get four GPUs, four CPU cores, and 8GB of memory. The remaining GPUs, CPUs, and memory will then be accessible for other jobs.
To run interactive job, do not use the usual
srun command, as this does not work correctly with the gres. Instead, use the
salloc command, e.g.
salloc -n 1 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:1
This will allocate the resources to the job, but keeps the prompt on the interactive node. You can then use srun or mpirun commands to launch the calculation on the allocated compute node resources.
All GPU nodes have installed NVidia CUDA and PGI CUDA Fortran and OpenACC compilers.
The default CUDA version on the nodes is 6.0, we recommend to use newer version, which
at the time of writing is 7.5. This is accessible at
/usr/local/cuda-7.5, or by simply loading the CUDA module,
module load cuda/7.5. PGI compilers come with their own CUDA which is quite recent, and can be set up
by loading the PGI module,
module load pgi.
We recommend to get an interactive session on the compute node in order to compile a CUDA code, but, in a pinch, any interactive node should work as the CUDA is installed on the interactives as well. PGI compilers come with their own CUDA so compiling anywhere from where you can load the PGI module should work.
To compile CUDA code so that it runs on all the three GPUs that we have, use the following
-gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_37,code=sm_37 -gencode=arch=compute_52,code=sm_52. If you use newer CUDA, also consider using newer GCC compilers. We recommend gcc
module load gcc/4.9.2 ). Note that CUDA 7.5 does not support GCC compilers 5.0 and higher. CUDA programming
guide is available at http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
The PGI compilers specify the GPU architecture with the
-tp=tesla flag. If no further option is specified, the flag will generate code for all available
computing capabilities (at the time of writing cc20, cc30, cc35 and cc50). To be specific
for each GPU,
-tp=tesla:cc20 can be used for the M2090,
-tp=tesla:cc35 for the K80 and
-tp=tesla:cc50 for the TitanX. To invoke the OpenACC, use
-acc flag. More information on OpenACC can be obtained at http://www.openacc.org.
Good tutorials on GPU programming are available a the CUDA Education and Training site from NVidia.
When running the GPU code, it is worth checking the resources that the program is
using, to ensure that the GPU is well utilized. For that, one can run the
nvidia-smi command, and watch for the memory and CPU utilization.
nvidia-smi is also useful to query and set various features of the GPU, see
nvidia-smi --help for all the options that the command takes. For example,
nvidia-smi -L lists the GPU card properties. On the TitanX nodes:
Titan Card: GPU 0: GeForce GTX TITAN X (UUID: GPU-cd731d6a-ee18-f902-17ff-1477cc59fc15)
NVidia's CUDA distribution includes a terminal debugger named
cuda-gdb. Its operation is similar to the GNU
gdbdebugger. For details, see the cuda-gdb documentation.
For out of bounds and misaligned memory access errors, there is the cuda-memcheck tool. For details, see the cuda-memcheck documentation.
The Totalview debugger that we license also supports CUDA and OpenACC debugging. Due to its user friendly graphical interface we recommend using Totalview for GPU debugging. For information on how to use Totalview, see our debugging page.
Profiling can be very useful in finding GPU code performance problems, for example
inefficient GPU utilization, use of shared memory, etc. NVidia CUDA provides both
command line (
nprof) and visual profiler (
nvvp). More information is in the CUDA profilers documentation.
We have the following GPU codes installed:
|Code name||Module name||Prerequisite modules||Sample batch script(s) location||Other notes|
|/uufs/chpc.utah.edu/sys/installdir/vasp/examples||Per group license, let us know if you need access|
|adapt CPU script|
|LAMMPS||lammps/10Aug15||intel/2016.0.109 impi/188.8.131.52||adapt CPU script|
If there is any other GPU code that you would like us installed, please let us know.
Some commercial programs that we have installed, such as Matlab, also have GPU support. Either try them or contact us.