R (Programming Language)
R is a programming language and software environment for statistical computing and graphics.
For use on the notchpeak, kingspeak, lonepeak, and ash clusters, and on Linux desktops, we have installed R from the source code. We also installed a number of external R libraries. If there is another library that you want to use, please try to install the library in your own environment. If you run into trouble, feel free to ask us to perform the installation.
The currently supported version is 4.1.3 (Rocky 8). It was built with the GNU compilers and its threaded Math Kernel Library (MKL). The presence of MKL may result in a considerable speed-up when compared to R builds which rely solely on non-optimized mathematical libraries. As a rule of thumb, programs that use a lot of floating point numerical calculations should benefit from multi-threading the most.
By default we have turned off multi-threading by setting the environmental variable OMP_NUM_THREADS to 1, i.e.
setenv OMP_NUM_THREADS 1 # Tcsh/Csh Shell
export OMP_NUM_THREADS=1 # Bash Shell
to facilitate easier use of parallel independent calculations. If you want to run R in a multithreaded fashion (e.g. on a compute node), we strongly recommend not to use more threads than there are physical cores on the node.
How to load R in your environment
You can obtain R in your environment by loading the R module i.e.:
module load R
The command R --version
returns the version of R you have loaded:
R --version
R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
The command which R
returns the location where the R executable resides:
which R
/uufs/chpc.utah.edu/sys/installdir/r8/R/4.1.3/bin/R
We also maintain a number of older versions of R. You can list these versions with
the command "module spider R", and load a specific version with a command such as
"module load R/4.1.1".
Note: if you use an ~/.Rprofile file, it should be independent of the version of R, i.e. library paths should NEVER be set within this file.
Running an R batch script on the command line
There are several ways to launch an R script on the command line:
-
Rscript yourfile.R
-
R CMD BATCH yourfile.R
-
R --no-save < yourfile.R
-
./yourfile2.R
The first approach (i.e. using the Rscript
command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH
command) redirects its output into a file (in case yourfile.Rout
). A third approach is to redirect the input of the file yourfile.R
to the R executable. Note that in the latter approach you must specify one of the following flags: --save
, --no-save
or --vanilla.
The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:
-
One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
-
As a result we have a new file yourfile2.R
-
The permissions of the R script (i.e.
yourfile2.R)
need to be altered (-> executable)
The files seaice.R and seaice2.R can be used/seen as examples for yourfile.R
, respectively yourfile2.R
. Note that the scripts seaice.R
and seaice2.R
require the data file sea-ice.txt.
Sometimes we need to feed arguments to the R script. This is especially useful if
running parallel independent calculations - different arguments can be used to differentiate
between the calculations, e.g. by feeding in different initial parameters. To read
the arguments, one can use the commandArgs()
function, e.g., if we have a script called myScript
:
## myScript.R
args <- commandArgs(trailingOnly =TRUE)
rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))
then we can call it with arguments as e.g.:
> Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853
Running a R batch script on the cluster (using SLURM)
In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.
Below you will find the content of the corresponding Slurm batch script runR.sl:
#!/bin/bash
#SBATCH --time=00:10:00 # Walltime
#SBATCH --nodes=1 # Use 1 Node (Unless code is multi-node parallelized)
#SBATCH --ntasks=1 # We only run one R instance = 1 task
#SBATCH --cpus-per-task=32 # number of threads we want to run on
#SBATCH --account=owner-guest
#SBATCH --partition=notchpeak-guest
#SBATCH -o slurm-%j.out-%N
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@utah.edu # Your email address
#SBATCH --job-name=seaIce
export FILENAME=seaice.R
export SCR_DIR=/scratch/general/nfs1/$USER/$SLURM_JOBID
export WORK_DIR=$HOME/TestBench/R/SeaIce
# Load R (version 4.1.3)
module load R
# Take advantage of all the threads (linear algebra)
# $SLURM_CPUS_ON_NODE returns actual number of cores on node
# rather than $SLURM_JOB_CPUS_PER_NODE, which returns what --cpus-per-task asks for
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE
# Create scratch & copy everything over to scratch
mkdir -p $SCR_DIR
cd $SCR_DIR
cp -p $WORK_DIR/* .
# Run the R script in batch, redirecting the job output to a file
Rscript $FILENAME > $SLURM_JOBID.out
# Copy results over + clean up
cd $WORK_DIR
cp -pR $SCR_DIR/* .
rm -rf $SCR_DIR
echo "End of program at `date`"
We run the script under Slurm as sbatch runR.sl
.
Running many independent R batch caculations as one job
We mentioned above that both versions of R were built using the multi-threaded MKL
library. The thread based parallelization is useful for vectorized R programs, but,
not all workflows vectorize. Therefore, if one has many independent calculations to
run, it is more efficient to run single threaded R and use SLURM's capability of running
independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl
) lets you run an independent R job on each core of a node. Note that you also need
one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).
#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --mail-user=$USER@utah.edu
#SBATCH -o out.%j
#SBATCH -e err.%j
#SBATCH --account=owner-guest
#SBATCH --partition=lonepeak-guest
#SBATCH --job-name=test-RArr
# Job Parameters
export EXE=./rwrapper.sh
export WORK_DIR=~/TestBench/Slurm/RMulti
export SCRATCH_DIR=/scratch/local/$USER/$SLURM_JOBID
export SCRIPT_DIR=$WORK_DIR/RFiles
export OUT_DIR=$WORK_DIR/`echo $UUFSCELL | cut -b1-4`/$SLURM_JOBID
# Load R
module load R
# Run an array of serial jobs
export OMP_NUM_THREADS=1
echo " Calculation started at:`date`"
echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"
# Create the my.config.$SLURM_JOBID file on the fly
for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
done > my.config.$UUFSCELL.$SLURM_JOBID
# Running a task on each core
cd $WORK_DIR
srun --multi-prog my.config.$UUFSCELL.$SLURM_JOBID
# Clean-up the root scratch dir
rm -rf $SCRATCH_DIR
echo " Calculation ended at:`date`"
Parallel R
The R environment itself is not parallelized, which is important to keep in mind when running on CHPC cluster nodes which have at least 8 CPU cores. Typical unvectorized R programs will run using only a single core.
The R installation detailed above can run certain workloads (mostly linear algebra) using multiple threads through the Intel Math Kernel Library (MKL). We recommend to benchmark your first run using OMP_NUM_THREADS=1, and then using higher core count (e.g. for 8 core node, OMP_NUM_THREADS=8), to see if it achieves any speed-up.
If the multi-threading does not provide much speedup, or, one needs to run on more than one node, some kind of parallelization of the R code is necessary. There are numerous R packages that implement various levels of parallelism, which are summarized at this CRAN page.
If the computational tasks are independent of each other, one can relatively simply use the foreach package, or parallelized versions of the *apply functions, which use the parallel package's multiple R workers. It is most common to equate the number of R workers to the CPU cores (SLURM job tasks), and set OMP_NUM_THREADS=1 to turn off the multi-threading. For running on a single compute node, here are the SLURM script example and R code example.
To run on multiple cluster compute nodes, one also has to tell R what hosts to run
on. This requires creating a list of hosts in the SLURM script, srun -n $SLURM_NTASKS hostname > hostlist.txt
(like in this SLURM script), and inside of the R program, feeding this list to the makeCluster() function, as in the following example, which would work on any number of cluster nodes:
# load the parallel libraries
library(parallel)
library(foreach)
library(doParallel)
# import hostlist
hostlist <- paste(unlist(read.delim(file="hostlist.txt", header=F, sep =" ")))
# launch the parallel R workers
cl <- makeCluster(hostlist)
registerDoParallel(cl)
# Fix bug in R < 4.0, give the workers path to optional R packages
# clusterEvalQ(cl,.libPaths("/uufs/chpc.utah.edu/sys/installdir/RLibs/3.5.2i"))
# run the parallel calculation
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
system.time({
r <- foreach(icount(trials), .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
Finally, some R libraries have their internal parallelization. To quickest way to find out if the library/function that you are using can be run in parallel is to do a web search.
For example, to find if library momentuHMM
has any parallel options, we can search for R momentuHMM parallel. The first hit is the library's manual. Searching for the parallel
keyword in the manual, we can find a few functions that allow parallel processing.
RStudio
RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.
RStudio is installed on Linux systems and can be invoked (after loading R) as follows:
module load RStudio
rstudio
Installing additional R packages
R library locations
R packages are installed in libraries. Before addressing the installation of R packages as such, we will first detail the hierarchical structure of the R libraries that are installed on the CHPC Linux systems.
The command .libPaths()
returns the names of the libraries (directories) which are accessible to the R executable
which has been loaded in your environment.
In the recently installed R distributions, we can have three library levels:
-
Core/Default Library
-
Site Library
-
User Libraries
The Core & Default R Packages were installed in a sub directory of the main installation directory when a new version
of R was compiled. The location of the library can be retrieved by the .Library
command. Among the packages in this library we have "base", "datasets", "utils",
etc.
which R
/uufs/chpc.utah.edu/sys/installdir/r8/R/4.1.3/bin/R
R> .Library
[1] "/uufs/chpc.utah.edu/sys/installdir/r8/R/4.1.3/lib64/R/library"
The Site Library contains all the external packages that have been installed by the CHPC staff for a well-defined version of R, i.e. each version of R has its own Site Library (note that each R version may have been compiled with a different version of a compiler,
different compiler flags or for a different version of the OS). The location of the Site Library library can be found within R using either .Library.site
or Sys.getenv("R_LIBS_SITE")
or by invoking echo $R_LIBS_SITE
in the shell.
echo $R_LIBS_SITE
/uufs/chpc.utah.edu/sys/installdir/RLibs/4.1.1/
R
>.Library.site
[1] "/uufs/chpc.utah.edu/sys/installdir/RLibs/4.1.1/"
>Sys.getenv("R_LIBS_SITE")
[1] "/uufs/chpc.utah.edu/sys/installdir/RLibs/4.1.1/"
The User Library is a subdirectory in the user's space (e. g. $HOME) where the user can install his/her packages. Note that each major version of R for which you want to install your own packages, should have its own user library directory. Modern R versions automatically set and create an user library location in user's home directory.\
Setting up your own library
In the following lines we will describe how to check if R User Library is active. Please, note that the User Library is only compatible within the minor version, you will need to make a new User Library for newer version (e.g. version 3.5 and 3.6).
module load R
which R
/uufs/chpc.utah.edu/sys/installdir/R/4.1.1/bin/R
R
> Sys.getenv("R_LIBS_USER")
[1] "~/R/x86_64-pc-linux-gnu-library/4.1"
This shows that R/4.1.1 has an User Library located at ~/R/x86_64-pc-linux-gnu-library/4.1. If you see this, your User Library is set up. If you are using older versions of R that don't automatically set User Library, proceed with instructions below.
The set-up of the User Library goes through the following steps:
NOTE: Do these steps only if the User Library above is not defined. You don't need to do these steps if you use R >= 4.0.
-
Load an R module
module load R
-
If you don't have created your own "modules" directory yet, then create your own module directory (e.g.
~/MyModules
).mkdir -p ~/MyModules
-
Create an R subdirectory in ~/MyModules. The R subdirectory will contain all your own future R modules.
mkdir ~/MyModules/R
-
Copy the R module from the CHPC modules directory into your own R module space. The R_VERSION environment variable is defined inside of the module file to denote the R version.
cp /uufs/chpc.utah.edu/sys/modulefiles/CHPC-18/Core/R/$R_VERSION.lua ~/MyModules/R/$R_VERSION.$USER.lua
-
We will now create a new directory in our home directory where we will install our new R packages that can be used with the CHPC executable of R:
mkdir -p ~/RLibs/$R_VERSION
-
Edit the newly created module e.g.
~/MyModules/R/$R_VERSION.$USER.lua
, to add the following line:setenv("R_LIBS_USER",pathJoin(os.getenv("HOME"),"RLibs",myModuleVersion()))
-
Unload the CHPC R module
module unload R
-
We can only load the new module if the newly created module directory is visible to LMOD, i.e. when it is inserted in the MODULEPATH environmental variable. You can add it to the LMOD MODULEPATH variable as follows:
You can insert themodule use ~/MyModules\
module use ~/MyModules
statement in your~/.custom.sh
file or~/.custom.csh
file so that the new module becomes always visible.
We have now set up our own User Library installation directory where we can install packages that can be compiled with the same compiler that was used to build CHPC's version of R (i.e. the Intel compiler).
Installing packages in your environment
After setting up the module for our own version of R and the User Library we can install
packages in our own environment. To start, we first need to load our own version of
R:
module load R.$USER
echo $R_LIBS_USER
The content of $R_LIBS_USER should refer to your newly created directory (see Previous Section #6). Within your new R environment you can also use .libPaths() to see the paths to all your libraries.
We can now install libraries in 2 different ways:
-
High-level version using
install.packages()
(invoked within R) -
Low-level version using
R CMD INSTALL
(invoked from a Linux Shell)
High-Level Installation
The high-level installation is the easiest way to install packages. It is the preferred way when the package to be installed does not depend on C, C++, Fortran libraries which are installed in non-traditional directories,
and particularly when the R code is available via CRAN, the Comprehensive R Archive Network. The R function to be invoked is install.packages():
R
>library(maRketSim)
Error in library(maRketSim) : there is no package called ‘maRketSim’
>install.packages(c("maRketSim"),
lib=c(paste("/uufs/chpc.utah.edu/common/home/",Sys.getenv("USER"),"/RLibs/",Sys.getenv("R_VERSION"),sep="")),
repos=c("http://cran.us.r-project.org"),verbose=TRUE)
>library(maRketSim)
The library($PACKAGE)
function tries to load a package $PACKAGE
. If R can't find it an error will be printed on stdout. The install.packages()
function has several flags. The lib
flag needs to be followed by the directory where you want to install the package
(should be $R_LIBS_USER
). From the installation output we notice that the install.packages()
function calls the low-level installation command (R CMD INSTALL
). This command will be discussed in the next section:
'/uufs/chpc.utah.edu/sys/installdir/R/4.1.1/lib64/R/bin/R CMD INSTALL -l \
'/uufs/chpc.utah.edu/common/home/$USER/RLibs/$R_VERSION' \
/tmp/RtmpH90XAY/downloaded_packages/maRketSim_0.9.2.tar.gz'
An alternative install function is used by the Bioconductor software repository. Bioconductor is the primary repository for R code for the life sciences, and uses the BiocManager::install() function:
BiocManager::install(pkgs)
Where "pkgs" is a character vector with one or more names of packages to be installed. This command, for example will install the Bioconductor DESeq2 package:
BiocManager::install("DESeq2")
BiocManager::install has a number of optional arguments. Run the command "?BiocManager::install" within R to see the complete documentation on the function.
Low-Level Installation
The low-level installation is to be used when you need to install R packages that
depend on external libraries that are installed in non-default locations. E.g. Let's consider the package RNetCDF
(already installed within CHPC's R).
The installation of this package depends on the external libraries netcdf-c and udunits2. The command to be invoked to install the RNetCDF
package in a User Library is (assuming bash shell:
module load intel netcdf-c udunits
export PATH=$NETCDFC/bin:$PATH (or in tcsh shell, setenv PATH $NETCDFC/bin:$PATH)
export PATH=$UDUNITS/bin:$PATH (or in tcsh shell, setenv PATH $UDUNITS/bin:$PATH)
wget https://cran.r-project.org/src/contrib/RNetCDF_1.9-1.tar.gz
R CMD INSTALL --library=/uufs/chpc.utah.edu/common/home/$USER/RLibs/$R_VERSION \
--configure-args="CPPFLAGS='-I$UDUNITS/include'\
LDFLAGS='-Wl,-rpath=$NETCDFC/lib \
-L$NETCDFC/lib -lnetcdf \
-Wl,-rpath=$UDUNITS/lib\
-L$UDUNITS/lib -ludunits2 ' \
--with-nc-config=$NETCDFC/bin/nc-config " RNetCDF_1.9-1.tar.gz
R CMD INSTALL
calls ./configure
under the hood. The best way to tackle such an installation is to download the tar.gz
file first, find the appropriate installation flags (different for each package!)
and then feed those flags to the R CMD INSTALL
command.
If you have trouble or questions, please send an email to helpdesk@chpc.utah.edu.
Potential Problems
Package installation with CHPC built R
The Intel compiler, that we use to build R, conflicts with gcc headers when using complex data types, resulting in an error similar to the one below when installing some R libraries:
/uufs/chpc.utah.edu/sys/installdir/intel/compilers_and_libraries_2018.1.163/linux/compiler/include/complex(310): error #308: member "std::complex<double>::_M_value" (declared at line 1337 of "/usr/include/c++/4.8.5/complex") is inaccessible
return __x / __y._M_value;
The workaround this is to disable this diagnostic error by creating (or modifying)
file ~/.R/Makevars
such as:
CFLAGS += -wd308
CXXFLAGS += -wd308
CPPFLAGS += -wd308
PKG_CFLAGS += -wd308
PKG_CXXFLAGS += -wd308
PKG_CPPFLAGS += -wd308
Package installation in Open OnDemand RStudio Server App
The RStudio Server does not run X, the Linux graphical environment. Some R librares require X to install, for example library 'rpanel'. The symptom of this issue is an error message like:
Warning message:
In fun(libname, pkgname) : couldn't connect to display ":0"
Error in structure(.External(.C_dotTcl, ...), class = "tclObj") :
[tcl] couldn't connect to display ":0".
These packages have to be installed in the FastX terminal session, as follows:
- open FastX terminal to one of our clusters
- load the Ondemand R module, e.g.:ml R/3.6.2-ood-geospatial
- start R and do the installation:R
> install.packages('rpanel')
(answer 'yes' to use personal library)