R

Introduction#

R is a programming language and software environment for statistical computing and graphics. It is similar to the S language and environment developed at Bell Laboratories. R provides a wide variety of statistical and graphical techniques and is highly extensible.

R on Sherlock#

R is available on Sherlock and the corresponding module can be loaded with:

$ ml R

For a list of available versions, you can execute ml spider R at the Sherlock prompt, or refer to the Software list page.

Using R#

Once your environment is configured (ie. when the R module is loaded), R can be started by simply typing R at the shell prompt:

$ R

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
[...]
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

For a listing of command line options:

$ R --help

Running a R script#

There are several ways to launch an R script on the command line, which will have different ways of presenting the script's output:

Method	Output
`Rscript script.R`	displayed on screen, on `stdout`
`R CMD BATCH script.R`	redirected to a `script.Rout` file
`R --no-save < script.R`	displayed on screen, on `stdout`

Submitting a R job#

Here's an example R batch script that can be submitted via sbatch. It runs a simple matrix multiplication example, and demonstrates how to feed R code as a HEREDOC to R directly, so no intermediate R script is necessary:

Rtest.sbatch

#!/usr/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=10G
#SBATCH --output=Rtest.log

# load the module
ml R

# run R code
R --no-save << EOF
set.seed (1)
m <- 4000
n <- 4000
A <- matrix (runif (m*n),m,n)
system.time (B <- crossprod(A))
EOF

You can save this script as Rtest.sbatch and submit it to the scheduler with:

$ sbatch Rtest.sbatch

Once the job is done, you should get a Rtest.out file in the current directory, with the following contents:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
[...]
> set.seed (1)
> m <- 4000
> n <- 4000
> A <- matrix (runif (m*n),m,n)
> system.time (B <- crossprod(A))
   user  system elapsed
  2.649   0.077   2.726

R packages#

R comes with a single package library in $R_HOME/library, which contains the standard and most common packages. This is usually in a system location and is not writable by end-users.

To accommodate individual user's requirements, R provides a way for each user to install packages in the location of their choice. The default value for a directory where users can install their own R packages is $HOME/R/x86_64-pc-linux-gnu-library/<R_version> where <R_version> depends on the R version that is used. For instance, if you have the R/3.5.1 module loaded, the default R user library path will be $HOME/R/x86_64-pc-linux-gnu-library/3.5.

This directory doesn't exist by default. The first time a user installs a package, R will ask if she wants to use the default location and create the directory.

Installing packages#

Install R packages in a standard shell session

Make sure to install your packages in a standard Sherlock shell session, not in an RStudio session.

To install a R package in your personal environment, the first thing to do is load the R module:

$ ml R

Then start a R session, and use the install.packages() function at the R prompt. For instance, the following example will install the doParallel package, using the US mirror of the CRAN repository:

$ R

R version 3.5.1 (2018-07-02) -- "Feather Spray"
[...]

> install.packages('doParallel', repos='http://cran.us.r-project.org')

It should give the following warning:

Warning in install.packages("doParallel", repos = "http://cran.us.r-project.org") :
  'lib = "/share/software/user/open/R/3.5.1/lib64/R/library"' is not writable
Would you like to use a personal library instead? (yes/No/cancel)
Would you like to create a personal library
‘~/R/x86_64-pc-linux-gnu-library/3.5’
to install packages into? (yes/No/cancel) y

Answering y twice will make R create a ~/R/x86_64-pc-linux-gnu-library/3.5 directory and instruct it to install future R packages there.

The installation will then proceed:

trying URL 'http://cran.us.r-project.org/src/contrib/doParallel_1.0.14.tar.gz'
Content type 'application/x-gzip' length 173607 bytes (169 KB)
==================================================
downloaded 169 KB

* installing *source* package ‘doParallel’ ...
** package ‘doParallel’ successfully unpacked and MD5 sums checked
** R
** demo
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (doParallel)

The downloaded source packages are in
        ‘/tmp/Rtmp0RHrMZ/downloaded_packages’
>

and when it's done, you should be able to load the package within R with:

> library(doParallel)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
>

Installing large packages#

Installing large R packages can sometimes be very time consuming. To speed things up, R can utilize multiple CPUs in parallel when the Ncpus=n option is added to the install.packages() command (where n is the number of CPUs you'd like to use).

For instance, you can get an interactive session with 4 CPU cores with sh_dev:

$ sh_dev -c 4
$ ml R
$ R
> install.packages("dplyr", repos = "http://cran.us.r-project.org", Ncpus=4)

Alternative installation path#

To install R packages in a different location, you'll need to create that directory, and instruct R to install the packages there:

$ mkdir ~/R_libs/
$ R
> install.packages('doParallel', repos='http://cran.us.r-project.org', lib="~/R_libs")

The installation will proceed normally and the doParallel package will be installed in $HOME/R_libs/.

Specifying the full destination path for each package installation could quickly become tiresome, so to avoid this, you can create a .Renviron file in your $HOME directory, and define your R_libs path there:

$ cat << EOF > $HOME/.Renviron
R_LIBS=~/R_libs
EOF

With this, whenever R is started, the $HOME/R_libs/ directory will be added to the list of places R will look for packages, and you won't need to specify this installation path when using install.packages() anymore.

Where does R look for packages?

To see the directories where R searches for packages and libraries, you can use the following command in R:

> .libPaths()

Sharing R packages

If you'd like to share R packages within your group, you can simply define $R_LIBS to point to a shared directory, such as $GROUP_HOME/R_libs and have each user in the group use the instructions below to define it in their own environment.

Setting the installation repository#

When installing a package, R needs to know from which repository the package should be downloaded. If it's not specified, it will prompt for it and display a list of available CRAN mirrors.

To avoid setting the CRAN mirror each time you run install.packages you can permanently set the mirror by creating a .Rprofile file in your $HOME directory, which R will execute each time it starts.

For instance, adding the following contents to your ~/.Rprofile will make sure that every install.packages() invocation will use the closest CRAN mirror:

## local creates a new, empty environment
## This avoids polluting the global environment with
## the object r
local({
  r = getOption("repos")
  r["CRAN"] = "https://cloud.r-project.org/"
  options(repos = r)
})

Once this is set, you only need to specify the name of the package to install, and R will use the mirror you defined automatically:

> install.packages("doParallel")
[...]
trying URL 'https://cloud.r-project.org/src/contrib/doParallel_1.0.14.tar.gz'
Content type 'application/x-gzip' length 173607 bytes (169 KB)
==================================================
downloaded 169 KB

Installing packages from GitHub#

R packages can be directly installed from GitHub using the devtools package. devtools needs to be installed first, with:

> install.packages("devtools")

And then, you can then install a R package directly from its GitHub repository. For instance, to install dplyr from tidyverse/dplyr:

> library(devtools)
> install_github("tidyverse/dplyr")

Package dependencies#

Sometimes when installing R packages, other software is needed for the installation and/or compilation. For instance, when trying to install the sf package, you may encounter the following error messages:

> install.packages("sf")
[...]
Configuration failed because libudunits2.so was not found. Try installing:...
[...]
configure: error: gdal-config not found or not executable.

This is because sf needs a few dependencies, like udunits and gdal in order to compile and install successfully. Fortunately those dependencies are already available as modules on Sherlock.

Whenever you see "not found" errors, you may want to try searching the modules inventory with module spider:

$ module spider udunits

----------------------------------------------------------------------------
  udunits: udunits/2.2.26
----------------------------------------------------------------------------
    Description:
      The UDUNITS package from Unidata is a C-based package for the
      programmatic handling of units of physical quantities.


    You will need to load all module(s) on any one of the lines below before
    the "udunits/2.2.26" module is available to load.

      physics

So for sf, in order to load the dependencies, exit R, load the udunits and gdal modules, and try installing sf again:

$ ml load physics udunits gdal geos
$ ml R/4.3.2
$ R
> install.packages("sf")

Getting dependencies right could be a matter of trial and error. You may have to load R, install packages, search modules, load modules, install packages again and so forth. Fortunately, R packages only need to be installed once, and many R package dependencies are already available as modules on Sherlock, you just need to search for them with module spider and load them.

And in case you're stuck, you can of course always send us an email and we'll be happy to assist.

Updating Packages#

To upgrade R packages, you can use the update.packages() function within a R session.

For instance, to update the doParallel package:

> update.packages('doParallel')

When the package name is omitted, update.packages() will try to update all the packages that are installed. Which is the most efficient way to ensure that all the packages in your local R library are up to date.

Centrally installed packages can not be updated

Note that attempting to update centrally installed packages will fail. You will have to use install.packages() to install your own version of the packages in your $HOME directory instead.

Removing packages#

To remove a package from your local R library, you can use the remove.packages() function. For instance:

> remove.packages('doParallel')

Examples#

Installing `devtools`#

devtools is a package that provides R functions that simplify many common tasks. While its core functionality revolves around package development, devtools can also be used to install packages, particularly those on GitHub.

Installing devtools is somewhat memory-intensive and has several dependencies. The following example shows how to run an interactive session with 4 CPUs, load the modules for the necessary dependencies, and install devtools for R version 4.2.0.

# Launch interactive dev session with 4 CPUs

$ sh_dev -c 4

# Load the required modules

$ ml purge
$ ml R/4.2.0
$ ml system harfbuzz fribidi
$ ml cmake libgit2
$ ml openssl

# Launch R and install devtools

$ R
> install.packages("devtools", repos = "http://cran.us.r-project.org", Ncpus=4)

Single node#

R has a couple of powerful and easy-to-use tools to parallelize your R jobs. doParallel is one of them. If the doParallel package is not installed in your environment yet, you can install it in a few easy steps.

Here is a quick doParallel example that uses one node and 16 cores on Sherlock (more nodes or CPU cores can be requested, as needed).

Save the two scripts below in a directory on Sherlock:

doParallel_test.RdoParallel_test.sbatch

# Example doParallel script

if(!require(doParallel)) install.packages("doParallel")
library(doParallel)

# use the environment variable SLURM_NTASKS_PER_NODE to set
# the number of cores to use
registerDoParallel(cores=(Sys.getenv("SLURM_NTASKS_PER_NODE")))

# bootstrap iteration example
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iterations <- 10000# Number of iterations to run

# parallel loop
# note the '%dopar%' instruction
parallel_time <- system.time({
  r <- foreach(icount(iterations), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]

# show the number of parallel workers to be used
getDoParWorkers()

# execute the function
parallel_time

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --output=doParallel_test.log

# --ntasks-per-node will be used in doParallel_test.R to specify the number
# of cores to use on the machine.

# load modules
ml R/3.5.1

# execute script
Rscript doParallel_test.R

And then submit the job with:

$ sbatch doParallel_test.sbatch

Once the job has completed, the output file should contain something like this:

$ cat doParallel_test.out
[1] "16"
elapsed
  3.551

Bonus points: observe the scalability of the doParallel loop by submitting the same script using a varying number of CPU cores:

$ for i in 2 4 8 16; do
    sbatch --out=doP_${i}.out --ntasks-per-node=$i doParallel_test.sbatch
done

When the jobs are done:

$ for i in 2 4 8 16; do
    printf "%2i cores: %4.1fs\n" $i $(tail -n1 doP_$i.out)
done
 2 cores: 13.6s
 4 cores:  7.8s
 8 cores:  4.9s
16 cores:  3.6s

Multiple nodes#

To distribute parallel R tasks on multiple nodes, you can use the Rmpi package, which provides MPI bindings for R.

To install the Rmpi package, a module providing MPI library must first be loaded. For instance:

$ ml openmpi R
$ R
> install.packages("Rmpi")

Once the package is installed, the following scripts demonstrate a very basic Rmpi example.

Rmpi-test.RRmpi-test.sbatch

# Example Rmpi script

if (!require("Rmpi")) install.packages("Rmpi")
library(Rmpi)

# initialize an Rmpi environment
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns, needlog=TRUE)

# send these commands to the slaves
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( ns <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )

# all slaves execute this command
mpi.remote.exec(paste("I am", id, "of", ns, "running on", host))

# close down the Rmpi environment
mpi.close.Rslaves(dellog = FALSE)
mpi.exit()

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --output=Rmpi-test.log

## load modules
# openmpi is not loaded by default with R, so it must be loaded explicitly
ml R openmpi

## run script
# we use '-np 1' since Rmpi does its own task management
mpirun -np 1 Rscript Rmpi-test.R

You can save those scripts as Rmpi-test.R and Rmpi-test.sbatch and then submit your job with:

$ sbatch Rmpi-test.sbatch

When the job is done, its output should look like this:

$ cat Rmpi-test.log
        3 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 4 is running on: sh-06-33
slave1 (rank 1, comm 1) of size 4 is running on: sh-06-33
slave2 (rank 2, comm 1) of size 4 is running on: sh-06-33
slave3 (rank 3, comm 1) of size 4 is running on: sh-06-34
$slave1
[1] "I am 1 of 4 running on sh-06-33"

$slave2
[1] "I am 2 of 4 running on sh-06-33"

$slave3
[1] "I am 3 of 4 running on sh-06-34"

[1] 1
[1] "Detaching Rmpi. Rmpi cannot be used unless relaunching R."

GPUs#

Here's a quick example that compares running a matrix multiplication on a CPU and on a GPU using R. It requires submitting a job to a GPU node and the gpuR R package.

gpuR-test.RgpuR-test.sbatch

# Example gpuR script

if (!require("gpuR")) install.packages("gpuR")
library(gpuR)

print("CPU times")
for(i in seq(1:7)) {
    ORDER = 64*(2^i)
    A = matrix(rnorm(ORDER^2), nrow=ORDER)
    B = matrix(rnorm(ORDER^2), nrow=ORDER)
    print(paste(i, sprintf("%5.2f", system.time({C = A %*% B})[3])))
}

print("GPU times")
for(i in seq(1:7)) {
    ORDER = 64*(2^i)
    A = matrix(rnorm(ORDER^2), nrow=ORDER)
    B = matrix(rnorm(ORDER^2), nrow=ORDER)
    gpuA = gpuMatrix(A, type="double")
    gpuB = gpuMatrix(B, type="double")
    print(paste(i, sprintf("%5.2f", system.time({gpuC = gpuA %*% gpuB})[3])))
}

#!/bin/bash

#SBATCH --partition gpu
#SBATCH --mem 8GB
#SBATCH --gres gpu:1
#SBATCH --output=gpuR-test.log

## load modules
# cuda is not loaded by default with R, so it must be loaded explicitly
ml R cuda

Rscript gpuR-test.R

After submitting the job with sbatch gpuR-test.sbatch, the output file should contain something like this:

[1] "CPU times"
[1] "1  0.00"
[1] "2  0.00"
[1] "3  0.02"
[1] "4  0.13"
[1] "5  0.97"
[1] "6  7.56"
[1] "7 60.47"

[1] "GPU times"
[1] "1  0.10"
[1] "2  0.04"
[1] "3  0.02"
[1] "4  0.07"
[1] "5  0.39"
[1] "6  2.04"
[1] "7 11.59"

which shows a decent speedup for running on a GPU for the largest matrix sizes.

R