General Cluster Information

Filesystems

NFS home directory

Your home directory, which is an NFS mounted file system, is one choice for I/O. This space carries the worst statistical performance in terms of I/O speed. This space is visible to all nodes on the clusters through an auto-mounting system. The one exception to this is apexarch, as the protected environment has a separate, isolated home directory space.

 

NFS Scratch 

All interactive nodes have access to several NFS mounted scratch file systems, including /scratch/general/lustre with a 700TB capacity, /scratch/kingspeak/serial with  175TB, and /scratch/lonepeak/serial  which has 33TB.   See the individual guides for information on which of these scratch file systems is mounted on the compute nodes of each cluster. Again, apexarch is the exception, having its own scratch file system.

 

Local disk (/scratch/local)

The local scratch space is a storage space unique to each individual node. The local scratch space is cleaned aggressively, with files older than 1 week being scrubbed. It can be accessed on each node through/scratch/local. This space will be one of the fastest, but certainly not the largest with the amount available varying between the clusters. Users must remove all their files from /scratch/local at the end of their calculation.

It is a good idea to make flows from one storage system to another when you are running jobs. At the start of the batch, job data files should be copied from the home directory to the scratch space, followed by a copy of the output back to the user's home directory at the end of the run. 

It is important to keep in mind that ALL users must remove excess files on their own. Preferably this should be done within the user's batch job when he/she has finished the computation. Leaving files in any scratch space creates an impediment to other users who are trying to run their own jobs. Simply delete all extra files from any space, other than your home directory, when it is not being used immediately.

User Environment

CHPC currently supports two shells: bash and tcshrc.  In addition, we use the Lmod module system to control the user environment.

 

CHPC provides two login files for each shell choice, the first, .tcshrc/.bashrc , which sets up a basic environment and the second,.custom.csh/.custom.sh, which a user can use to customize their environment.  Details can be found on the modules documentation page. All new accounts are  created using the modules framework, and all accounts, even those with the older style login files, have been enabled to make use of modules.

Applications

CHPC maintains a number of user applications as well as tools needed to install applications. For some of these we have software documentation pages which can be found in the software documentation section,  We also have a searchable database of the installed packages.  Also note that there are several packages which we have installed, e.g., abaqus, ansys, charmm, comsol, star-CCM+, that are licensed by individual groups and therefore are not accessible outside of that group.

Historically, applications that are not cluster specific have been installed  in /uufs/chpc.utah.edu/sys/pkg, whereas cluster specific applications (most typically due to the use of a cluster specific installation of MPI or another support package) are located in /uufs/$UUFSCELL/sys/pkg, where $UUFSCELL is kingspeak.peaks, ash.peaks, or ember.arches.

Moving forward, we are working to consolidate applications in /uufs/chpc.utah.edu/sys/installdir

Batch System 

The batch implementation on the CHPC clusters is Slurm.

Any process which requires more than 15 minutes run time needs to be submitted through the batch system. The command to submit to the batch system is sbatch , and there is also a way to submit an interactive job:

  • srun --pty -t 1:00:00 -n 4 -N 2 /bin/tcsh -l  

Other options forrunning jobs are given in the Slurm documentation mentioned above.

Each cluster has a hard walltime limit on general resources as well as jobs run as guest on owner resources. On most clusters this is 72 hours, however please see the individual cluster guides for specifics on a given cluster. If you find you need longer than this, please contact CHPC. Users without any allocation or those that have used all of their allocation can still run, but as "freecycle" mode. However, "freecycle" jobs are preemptable, i.e., they are subject to termination if a job with allocation needs the resources. We suggest that you start with some small runs to gauge how long the production runs will take. As a rule of thumb, consider a wall time which is 10-15% larger than the actual run time. If you specify a shorter wall time, you face the risk of your job being killed before finishing. If you set a wall time that is too large, you may face a longer waiting time in the queue.

In the following individual cluster user guides, details of a typical batch script are given.  

Nodes with different core counts are present on most clusters. The number of cores a job is to use can be specified with Slurm constraints.

Individual Cluster User Guides

Last Updated: 1/4/18