Using R on Savio¶

We provide R and a variety of commonly-used R packages via the Savio module system. R is configured to use the Intel MKL library for fast and parallel linear algebra operations. See below for more details about parallel linear algebra.

Loading R and accessing R packages¶

To access R from the terminal or in a Slurm job, you need to load the R module:

module load r

Older versions are available by loading a specific version, e.g., module load r/3.6.3. You can see the versions with module avail.

To load standard additional packages, please enter:

module load r-packages

and for a standard set of packages for spatial data:

module load r-spatial

Installing additional R packages¶

You can also install additional R packages, such as those available on CRAN, that are not already available on the system. You'll need to install them into your home directory or your scratch directory.

In the following example, we'll install the fields package for spatial statistics, which needs to compile some Fortran code as well as pull in some dependency packages. You can either set the directory in which to put the package(s) via the lib argument or follow the prompts provided by R to accept the default location (generally ~/R/x86_64-pc-linux-gnu-library/4.0). (If you've already installed packages for this version of R, the default location should already exist.) Here we'll use the default:

install.packages('fields')

Note that if you install them other than in the default location, e.g., via:

install.packages('fields', lib = '/global/scratch/users/yourusername/R')

you will probably need to set the environment variable R_LIBS_USER to include the non-default location so that R can find the packages. You can set R_LIBS_USER in your .bashrc file or, perhaps better, in your ~/.Renviron file. You can use the .libPaths() function in R to see where it looks for installed packages. And you can use the searchpaths() function to see where the loaded packages in your R session are installed.

Tip

Many R packages have dependencies on R packages already provided on the system, such as Rcpp, ggplot2, Rmpi, and dplyr. If you see that packages available on the system are being installed locally in your own directory when you try to install a package yourself, it's good practice to stop the installation and go back and load the r-packages and/or r-spatial module before installing the package of interest. This avoids installing a second copy of the dependency.

Tip

In some cases an R package will require an external non-R package as a dependency. If it's available on the system, you may need to load in the relevant Savio module via module load. If it's not available on the system you may be able to install the dependency yourself from the source code for the dependency, or you can ask us for help.

Running R interactively¶

Using `srun` to run R on the command line¶

To use R interactively on Savio's compute nodes, you can use srun to submit an interactive job.

Once you're working on a compute node, you can then load the R module and start R.

Using RStudio for interactive use via Open OnDemand¶

We now provide access to RStudio via Open OnDemand. This allows you to interact with RStudio from your web browser, but with RStudio running on Savio.

RStudio sessions can be run on compute nodes in any of the Savio partitions. In many cases you may want to use the savio2_htc or savio3_htc partitions to use one or a few cores and not be charged for use of a full node.

Parallel processing in R¶

R provides several ways of parallelizing your computations. We describe them briefly here and outline their use below:

Threaded linear algebra. R on Savio is already set up to use Intel's MKL package for linear algebra. MKL can automatically use multiple cores on a single machine, as described below.
Multi-process parallelization on a single node. You can use functions provided in R packages such as future, foreach and parallel to run independent calculations across multiple cores on a single node.
Multiple nodes. You can use functions provided in R packages such as future, foreach and parallel to run inependent calculations across multiple cores on multiple nodes.

1. Threaded linear algebra¶

You can make use of threaded (i.e., parallelized) linear algebra simply by running R code that uses R's linear algebra functions.

Warning

Since threaded linear algebra only works on a single node, you shouldn't request multiple nodes and, to be safe, should avoid using --ntasks if using the HTC partitions (savio2_htc or savio3_htc) as it would be possible to end up with multiple cores spread across multiple nodes.

Tip

To verify that R is using MKL, you can run sessionInfo() in R. You should see a line like this:

BLAS/LAPACK: /global/software/sl-7.x86_64/modules/langs/intel/2016.4.072/compilers_and_libraries_2016.4.258/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

pointing to an MKL shared object file.

Here's an example Slurm job script for a job that uses threaded linear algebra. Basically all you need to do is specify the number of threads you want to use as an environment variable, MKL_NUM_THREADS. (In fact, by default the linear algebra should use as many threads as possible without you even specifying MKL_NUM_THREADS.) Then linear algebra operations done in R will use that many cores automatically.

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Specify one task:
#SBATCH --ntasks-per-node=1
#
# Number of processors for threading:
#SBATCH --cpus-per-task=20
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load r
R CMD BATCH --no-save job.R job.Rout

Note that here we make use of all the cores on the node (20 here, assuming use of the savio partition, which contains 20-core nodes) for the threaded linear algebra, but in some cases using too many cores might actually decrease performance, so it may be worth some experimentation with your code to determine the best number of cores. You can also simply set MKL_NUM_THREADS to a fixed number.

If you want to use a small number of threads and not have your job be charged for unused cores, you may want to run your job on one of Savio's High Throughput Computing (HTC) nodes (e.g., by selecting the savio2_htc partition) as follows:

Here is an example job script to use this kind of parallelization on an HTC node, here using two cores:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=savio2_htc
#
# Specify one task:
#SBATCH --ntasks=1
#
# Number of processors for threading:
#SBATCH --cpus-per-task=2
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load r
R CMD BATCH --no-save job.R job.Rout

2. Multi-process parallelization on a single node¶

Using the future package on a single node¶

The future package provides an elegant interface to run parallel computation across a variety of hardware resources, including a single node or multiple nodes.

As discussed in this tutorial and this vignette, the future package allows one to write one's computational code without hard-coding whether or how parallelization would be done. Instead one writes the code in a generic way and at the top of one's code sets the plan for how the parallel computation should be done given the computational resources available. Simply changing the plan changes how parallelization is done for any given run of the code.

More concisely, the key ideas are:

Separate what to parallelize from how and where the parallelization is actually carried out.
Run the same code on different computational resources (without touching the actual code that does the computation).

Here we'll discuss its use on a single node. In this case one can either use the multisession or multicore plan.

Tip

The multicore backend forks the main R process. This creates R worker processes with the same state as the original R process. All objects point back to the original objects in the main process and do not use additional memory, and no copying is involved. However if the worker process(es) modify an object, then a copy needs to be made.

Here's the basic syntax for using the future package with a parallel lapply.

plan(multicore)
future.apply::future_sapply(1:100, function(i) return(i))

plan(multicore) will use parallelly::availableCores() to determine the number of workers to start, based on the number of cores requested in your Slurm job submission.

One can also use the future package as the backend for foreach by using doFuture::registerDoFuture().

Using other R parallelization tools on a single node¶

Other functions in R that provide parallelization across multiple cores on a node include parLapply, mclapply, and foreach (using the doParallel backend).

Tip

mclapply uses forking to start up the R workers. This saves memory and time because the R objects on the workers point back to the objects in the original R process, unless those objects are modified by the workers. You can also have parLapply and foreach use forking by using the (non-default) parallel::makeForkCluster to start the workers, as discussed in Section 3.1.3 of this tutorial.

Here are the setup steps in R for using the foreach function:

library(doParallel)
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
registerDoParallel(ncores)
result <- foreach(i = 1:nIts) %dopar% {
        # body of loop
}

Here's some R syntax to use the parLapply and mclapply functions, available in the parallel package.

library(parallel)
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
cl <- makeCluster(ncores)
result <- parSapply(cl, X, FUN)

See help(clusterApply) for more information.

Using mclapply would look like this:

ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
result <- mclapply(X, FUN, ..., mc.cores = ncores)

In some cases the R commands that set up parallelization may recognize the number of cores available on the machine automatically. In many cases however, you will need to read an environment variable such as SLURM_CPUS_ON_NODE into R and pass that as an argument to the relevant R functions, as shown above.

3. Parallelization on multiple nodes¶

Danger

Because R on Savio uses the MKL library for linear algebra, the MKL module must be loaded on all nodes on which R workers are running. Unfortunately, the only real way to achieve this is to add a line containing module load r to your .bashrc file, which will load all modules that R needs, including MKL. This is awkward in that you may not want the R module and related modules loaded in all your Savio sessions. So you may need to comment/uncomment the line in your .bashrc depending on whether you are using multi-node R parallelization at any given time.

Using the future package on multiple nodes¶

In your Slurm submission, make sure to request as many tasks (using --ntasks or --ntasks-per-node) as R workers that you want to use.

Then when using the future package, use the cluster plan. plan(cluster) will use parallelly::availableWorkers() to determine the number of workers to start, based on the resources requested in your Slurm job submission.

plan(cluster)

Alternatively, you could set specify the workers manually. Here we use srun (note this is being done within our original sbatch or srun) to run hostname once per Slurm task, returning the name of the node the task is assigned to.

workerNodes <- system('srun hostname', intern = TRUE)
plan(cluster, workers = workerNodes)

In either case, we can verify that the workers are running on the various nodes by checking the nodename of each of the workers:

future.apply::future_sapply(seq_len(nbrOfWorkers()), function(i) Sys.info()[["nodename"]])

Using other R parallelization tools on multiple nodes¶

You can run parallel apply statements and foreach across the cores on multiple nodes, provided you set things up so the workers can start on all the nodes.

In your Slurm submission, make sure to request as many tasks (using --ntasks or --ntasks-per-node) as R workers that you want to use. Then the key step in R is to give makeCluster the information about the nodes available. Here we use srun (note this is being done within our original sbatch or srun) to run hostname once per Slurm task, returning the name of the node the task is assigned to.

workerNodes <- system('srun hostname', intern = TRUE)
cl <- parallel::makeCluster(workerNodes)

Now use the cluster object, cl in your call to parLapply or registerDoSNOW or similar commands.

We recommend using doSNOW rather than doMPI as avoiding the use of MPI can simplify things.

Running R jobs on Savio's GPU nodes¶

Savio does not provide any R packages that take advantage of GPUs at the system level. However there are a variety of R packages that allow you to make use of GPUs from within R, including many available on CRAN, as described in the GPU section of this Task View. You'll need to write, adapt, or use R code that has been written for GPU access based on these packages. To install such packages you'll generally need to load in the CUDA module via module load cuda on a GPU node.

To run R jobs on one or more GPUs, you'll need to request access to the GPU(s) by including the --gres=gpu:x flag to sbatch or srun, where x is the number of GPUs you need, following our example GPU job script.

Using non-ASCII (non-English) characters and UTF-8¶

If you need to be able to display characters from other languages and more generally a broader array of characters, you can modify the "locale" used for handling characters by setting the LC_CTYPE shell environment variable before starting R, like this:

export LC_CTYPE=en_US.UTF-8

If you then use UTF-8 characters in R, they should display like this:

'Pe\u00f1a 3\u00f72'
# [1] "Peña 3÷2"