Using R on Savio

Loading R and Packages | Installing Packages | Parallel Processing | Parallel: Linear Algebra | Parallel: Single Node | Parallel: Multiple Nodes | Parallel: GPUs | Running R Interactively

This document describes how to use R, a language and environment for statistical computing and graphics, on the Savio high-performance computing cluster at the University of California, Berkeley.

Loading R and accessing R packages

To load R into your current software environment on Savio, at any shell prompt, enter:

module load r

Once you have loaded R, you can determine what R packages are provided by the system by entering the following and looking for the section pertaining to R that lists a variety of R packages:

module avail

To use one or more of the packages, load their relevant module(s) (e.g., ggplot2 in this case) before starting R:

module load ggplot2

Then in R, use library(ggplot2) as usual to load the package into R.

Installing additional R packages

You can also install additional R packages, such as those available on CRAN, that are not already available on the system. You'll need to install them into your home directory or your scratch directory.

First, enter module list to make sure that the Intel module is not loaded, as this can interfere with the R package installation process for packages that use C/C++/Fortran code. If it is loaded, simply do:

module unload intel

Then start R and use install.packages(...).

In the following example, we'll install the fields package for spatial statistics, which needs to compile some Fortran code as well as pull in some dependency packages. You can either set the directory in which to put the package(s) via the lib argument or follow the prompts provided by R to accept the default location (generally ~/R/x86_64-pc-linux-gnu-library/3.2). Here we'll use the default:

install.packages('fields')

Note that if you install them other than in the default location, e.g., via:

install.packages('fields', lib = '/scratch/users/myusername/R')

you will probably need to set the environment variable R_LIBS_USER to include the non-default location (e.g., setting it in your .bashrc file) so that R can find the packages.

Many R packages have dependencies on R packages already provided on the system, such as Rcpp, ggplot2, Rmpi, and dplyr. If you see that packages available on the system are being installed locally in your own directory when you try to install a package yourself, it's good practice to stop the installation and go back and load the modules for the available R packages before installing the package of interest. This avoids installing a second copy of the dependency.

In some cases an R package will require an external non-R package as a dependency. If it's available on the system, you may need to load in the relevant Savio module via module load packagename. If it's not available on the system you may be able to install the dependency yourself from the source code for the dependency, or you can ask the Savio user consultants for assistance

Parallel processing in R on Savio

R provides several ways of parallelizing your computations. We describe them briefly here and outline their use below:

  1. Threaded linear algebra. R on Savio is already set up to use Intel's MKL package for linear algebra. MKL can automatically use multiple cores on a single machine, as described below.
  2. Multi-process parallelization on a single node. You can use functions provided in R packages such as foreach and parallel to run independent calculations across multiple cores on a single node.
  3. Multiple nodes. You can use functions provided in R packages such as foreach and the pbdR packages to run calculations across multiple nodes.

1. Threaded linear algebra

Here's how you submit a job to use threaded linear algebra. Basically all you need to do is specify the number of threads you want to use as an environment variable. Then linear algebra operations done in R will use multiple cores automatically.

Example job script

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Specify one task:
#SBATCH--ntasks-per-node=1
#
# Number of processors for threading:
#SBATCH --cpus-per-task=20
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load r
R CMD BATCH --no-save job.R job.Rout

Note that here we make use of all the cores on the node (20 here, assuming use of the savio partition, which contains 20-core nodes) for the threaded linear algebra, but in some cases using more cores might actually decrease performance, so it may be worth some experimentation with your code to determine the best number of cores. You can also simply set MKL_NUM_THREADS to a fixed number.

If you want to use a small number of threads and not have your job be charged for unused cores, you may want to run your job on one of Savio's High Throughput Computing (HTC) nodes (e.g., by selecting the savio2_htc partition) as follows:

Example job script

Here is an example job script to use this kind of parallelization on an HTC node:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=savio2_htc
#
# Specify one task:
#SBATCH
--ntasks=1
#
# Number of processors for threading:
#SBATCH--cpus-per-task=2
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load r
R CMD BATCH --no-save job.R job.Rout

2. Multi-process parallelization on a single node

Example R code

In your R code, here are the setup steps in R for using the foreach function, available in the foreach package.

library(doParallel)
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
registerDoParallel(nCores)
out <- foreach(i = 1:nIts) %dopar% {
        # body of loop
}

Here's some R syntax to use the parLapply and mclapply functions, available in the parallel package.

library(parallel)
ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
cl <- makeCluster(ncores)
Result <- parSapply(cl, X, FUN)

See help(clusterApply) for more information.

Using mclapply would look like this:

ncores <- as.numeric(Sys.getenv('SLURM_CPUS_ON_NODE'))
result <- mclapply(X, FUN, ..., mc.cores = ncores)

Example job script

Here is an example job script to use this kind of parallelization:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH--nodes=1
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
module load r
module load doParallel # needed for use of foreach+doParallel
R CMD BATCH --no-save job.R job.Rout

In some cases the R commands that set up parallelization may recognize the number of cores available on the machine automatically. In many cases however, you will need to read an environment variable such as SLURM_CPUS_ON_NODE into R and pass that as an argument to the relevant R functions as shown above.

3. Parallelization on multiple nodes

Example R code

It's possible to run foreach across multiple nodes without using MPI, which can simplify things. Here's the R code to do so (you will need to install the doSNOW package):

library(doSNOW)
ncoresPerNode <-as.numeric(Sys.getenv("SLURM_CPUS_ON_NODE"))
nodeNames <-strsplit(Sys.getenv("SLURM_NODELIST"), ",")[[1]]
machines=rep(nodeNames, each = nCoresPerNode) )
cl = makeCluster(machines, type = "SOCK")
registerDoSNOW(cl)
out <- foreach(i = 1:nIts) %dopar% {
        # body of loop
}

To use foreach with MPI:

library(doMPI)
cl = startMPIcluster() # by default will start one fewer worker process than total CPUs available
registerDoMPI(cl)
out <- foreach(i = 1:nIts) %dopar% {
        # body of loop
}

Using pbdR is a bit more involved, and doesn't lend itself to a single block of example or template code. However, the code examples provided in the pbdR documentation should be directly usable on Savio.

To use pbdR, you'll first need to install the pbdR packages.This, in turn, requires that you enter module load gcc; module load r openmpi before installation. You'll also need to use a job script similar to the example below.

Example job script

Here is an example job script to use this kind of parallelization:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=partition_name
#
# Number of nodes for use case:
#SBATCH--nodes=2
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
module load r

### for foreach+doMPI ###
module load Rmpi
mpirun R CMD BATCH --no-save job-doMPI.R job-doMPI.Rout  

### for pbdR ###
mpirun Rscript job-pbd.R job-pbd.Rout


### for foreach+doSNOW ###
R CMD BATCH --no-save job-snow.R job-snow.Rout

Running R jobs on Savio's GPU nodes with parallel computing code

Savio does not provide any R packages that take advantage of GPUs at the system level. However there are a variety of R packages that allow you to make use of GPUs from within R available on CRAN, as described in the GPU section of this Task View. You'll need to write, adapt, or use R code that has been written for GPU access based on these packages. To install such packages you'll generally need to load in the CUDA module via module load cuda on a GPU node.

To run R jobs that contain parallel computing code on Savio's Graphics Processing Unit (GPU) nodes, you'l need to request one or more GPUs for its use by including the --gres=gpu:x flag (where the value of 'x' is 1, 2, 3, or 4, reflecting the number of GPUs requested), and also request two CPUs for every GPU requested, within the job script file you include in your sbatch command or as an option in your srun command. For further details, please see the GPU example in the examples of job submissions with specific resource requirements.

As well, in your R code, include commands that use the GPU. E.g., using the gmatrix package:

library(gmatrix)

x <- gmatrix(grnorm(8000*8000), 8000, 8000, dup = FALSE)

If you've requested use of multiple GPUs in your submission (each GPU node has 4 GPUs), in your R code you can switch between the GPUs, e.g., using the setDevice() function in the gmatrix package. It may be possible to use foreach to start up multiple processes but we have not developed template code for this case. Alternatively you could start four individual R jobs within your job script and make sure to set the device number, e.g., via setDevice(), to 0, 1, 2, 3, respectively, within each of those individual R jobs.

To check on the current usage (and hence availability) of each of the GPUs on your GPU node, you can use the nvidia-smi command from the Linux shell within an interactive session on that GPU node. Near the end of that command's output, the "Processes: GPU Memory" table will list the GPUs currently in use, if any. For example, in a scenario where GPUs 0 and 1 are in use on your GPU node, you'll see something like the following. (By implication from the output below, GPUs 2 and 3 are currently idle - not in use, and thus fully available - on this node.)

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32699 C .../modules/langs/r/3.2.5/lib64/R/bin/exec/R 729MiB |
| 1 32710 C .../modules/langs/r/3.2.5/lib64/R/bin/exec/R 729MiB |
=============================================================================|

Running R interactively (command line mode)

Step 1. Run an interactive shell

To use R interactively on Savio's compute nodes, you can use one of the following example commands (which use the long form of each option to srun) to run an interactive bash shell as a job on a compute node. That, in turn, should then let you launch R from that shell, on that compute node, and work interactively with it.

(Note: the following commands are only examples, and you'll need to substitute your own values for some of the example values shown here; see below for more details.)

srun --unbuffered --partition=savio --qos=savio_normal --account=ac_scsguest --time=00:30:00 bash -i

(or, if you plan to use R interactively via a GUI; see below for more details on doing this)

srun --pty --partition=savio --qos=savio_normal --account=ac_scsguest  --time=00:30:00 bash -i

For more information on running interactive SLURM jobs on Savio, please see Running Your Jobs.

Step 2: Run R from that shell

Once you're working on a compute node, your shell prompt will change to something like this (where 'n' followed by some number is the number of the compute node):

[myusername@n0033 ...]

At that shell prompt, you can then enter the following to load the R software module:

module load r

To start R for command line use, you can enter

R

R doesn't have a built-in GUI for Linux, but the popular RStudio development environment provides a very nice GUI. If you'd like to use RStudio on Savio, please contact us.