Programming in R

R Foundations 2024

Ella Kaye, Department of Statistics

November 4, 2024

Overview

  • More on data structures

  • Control flow and iteration functions

  • Efficient R programming

  • Writing functions (basics)

Introduction

Understanding the basics of R programming helps to improve analysis/reporting scripts and extend what we can do with R.

Good coding practice follows the DRY principle: Don’t Repeat Yourself. Rather than modifying copy-pasted code chunks, we might

  • write a custom function
  • use loops or iteration functions to perform multiple similar tasks

Custom functions can be used to provide convenient wrappers to complex code chunks as well as implement novel functionality.

More on data structures

Data structures revisited

For basic data analysis, our data is usually imported and we use high-level functions (e.g. from dplyr) to handle it.

For programming, we need to work with lower-level data structures and be able to

  • create basic objects
  • extract components
  • coerce one data type to another

Working with base R functions when programming also helps avoid dependencies, which is useful when writing packages.

Vectors

numeric(), character() and logical() can be used to initialize vectors of the corresponding type for a given length

x <- numeric(3)
x
[1] 0 0 0

Elements can be assigned by indexing the positions to be filled, e.g.

x[1] <- 4 # assign 4 to 1st element
x[-c(2, 3)] <- 4 # assign 4 to everying *except* 2nd and 3rd element

This is particularly useful when programming an iterative procedure.

as.logical(), as.numeric() and as.character() coerce to the corresponding type, producing NAs if coercion fails.

Logical vectors

Logical vectors are commonly used when indexing. The vector might be produced by a logical operator:

x <- c(1, 1, 2, 2, 2)
x > 1
[1] FALSE FALSE  TRUE  TRUE  TRUE
x[x > 1]
[1] 2 2 2

duplicated() is also useful here:

duplicated(x)
[1] FALSE  TRUE FALSE  TRUE  TRUE
!duplicated(x)
[1]  TRUE FALSE  TRUE FALSE FALSE

Numeric vectors

The are several convenience function for creating numeric vectors, notably seq() and rep().

As they are so useful there are fast shortcuts for particular cases

seq_len(4)
[1] 1 2 3 4
fruits <- c("apple", "pear", "banana")
seq_along(fruits) # a sequence from 1 to the length of x
[1] 1 2 3
rep.int(1:2, times = c(2, 3))
[1] 1 1 2 2 2

Character vectors

Character vectors may be used for creating names

x <- 3:5
names(x) <- paste0(LETTERS[1:3], 1229:1231)
x
A1229 B1230 C1231 
    3     4     5 
names(x)
[1] "A1229" "B1230" "C1231"

Names can be used as an alternative to numeric or logical vectors when indexing

x["B1230"]
B1230 
    4 

Matrices

A matrix is in fact also a vector, with an attribute giving the dimensions of the matrix

M <- matrix(1:6, 2, 3) # data, nrow, ncol
M
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
str(M)
 int [1:2, 1:3] 1 2 3 4 5 6
attributes(M)
$dim
[1] 2 3

The byrow argument is also useful:

N <- matrix(1:6, 2, 3, byrow = TRUE) # data, nrow, ncol
N
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Matrix functions

Useful functions for matrices include dim(), ncol(), nrow(), colnames() and rownames(). rbind() and cbind() can be used to row-bind or column-bind vectors.

Matrices enable computation via matrix algebra as well as row/column-wise operations.

Lists

Lists collect together items which may be different types or lengths. Like a vector, elements may be named.

results <- list(matrix = M, vector = x)
results
$matrix
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

$vector
A1229 B1230 C1231 
    3     4     5 

Lists are often used to return the results of a function.

Indexing lists

Elements can be indexed by [ to return a list or [[ to return a single element, either by index or name:

results["vector"] # returns a list
$vector
A1229 B1230 C1231 
    3     4     5 
results[[2]] # returns a vector
A1229 B1230 C1231 
    3     4     5 

$ can be used to extract elements by name:

results$vector # equivalent to results[["vector"]]
A1229 B1230 C1231 
    3     4     5 

Data frames

Data frames are lists of variables of the same length and hence can often be treated as a matrix

x <- 1:3
dat <- data.frame(x = x, id = letters[1:3])
dat
  x id
1 1  a
2 2  b
3 3  c
dat[1]
  x
1 1
2 2
3 3
dat[[1]]
[1] 1 2 3
dat[1, 2]
[1] "a"

Your turn!

The lm function calls the “workhorse” function lm.fit to actually fit the model. Unlike lm, which works from a formula, lm.fit works from the model matrix and the response vector.

  1. Define a response y containing 10 numeric values. Define an explanatory variable z of the numbers 1 to 10.

  2. Use the function cbind() to create a matrix x with 1s in the first column and z in the second column.

  3. Fit a model using fit1 <- lm.fit(x, y). Use str to explore the structure of the results. Use $ to extract the coefficients.

  4. Create a second fit using lm(y ~ z). Use names to compare the results. Check the coefficients of the second fit are the same.

Your turn! (a solution)

# A possible solution

set.seed(1)
y <- sample(10)
z <- 1:10

x <- cbind(1, z) # we'll see this trick in a few slides time!

fit1 <- lm.fit(x,y)
str(fit1)
fit1$coefficients

fit2 <- lm(y ~ z)
names(fit1)
names(fit2)
fit2$coefficients

~ notation

The ~ notation can be used to specify a model formula, where the LHS is the response and the RHS are a collection of predictors, e.g. 

body_mass_g ~ bill_length_mm + flipper_length_mm.

This can be used as the formula argument when fitting a model, e.g. a linear model:

lm(body_mass_g ~ bill_length_mm + flipper_length_mm, data = penguins) 

In the formula, crucially, the + does not have to mean it is a linear additive model and should be read more as “this variable and this one etc” which the model function might use additively (e.g. lm) or might not (e.g. many ML models).

Control flow

Control structures

Control structures are the commands that make decisions or execute loops.

Conditional execution: if/else, switch

Loops: for, while, repeat

if/else

An if statement can stand alone or be combined with an else statement

x <- 1:3
if (all(x > 0)) {
    res <- mean(x)
} else {
    res <- mean(abs(x))
}
res
[1] 2

The condition must evaluate to logical vector of length one. The functions all(), any(), is.na(), is.null() and other is. functions are useful here.

Conditioning on equality

Using == may not be appropriate as it compares each element; identical() will test the whole object

x <- y <- 1:2
x == y
[1] TRUE TRUE
identical(x, y)
[1] TRUE

all.equal() will allow for some numerical tolerance.

z <- sqrt(2)
identical(z * z, 2)
[1] FALSE
all.equal(z * z, 2)
[1] TRUE

switch

The switch() function provides a more readable alternative to nested if statements

if (summary == "IQR") { 
    y <- IQR(x)
} else {
    if (summary == "range"){
        y <- range(x)
    } else y <- mean(x)
}
x <- 1:5
switch("range", # can enter an arg name or position
       IQR = IQR(x),
       range = range(x),
       mean(x))
[1] 1 5

The final unnamed argument is the default. Further examples

for

A for loop repeats a chunk of code, iterating along the values of a vector or list

x <- c("apple", "pear")
for (nm in x) print(nm)
[1] "apple"
[1] "pear"

Unassigned objects are not automatically printed; hence call to print().

for (i in seq_along(x)) {
    message("Element ", i, ": ", x[i])
}

seq_along() is used here rather than 1:length(x) as length(x) may be zero. message is used to print messages to the console.

while and repeat

The while loop repeats while a condition is TRUE

n_iter <- 1
while (n_iter < 3) {
    x <- x * 2
    n_iter <- n_iter + 1
}

The repeat loop repeats until exited by break

repeat {
    x <- x + 1
    if (max(x) > 10) break
}

break can be used in for or while loops too.

next can be used to skip to the next iteration.

Iteration functions

Iteration functions provide a general alternative to for loops. They are not necessarily faster, but can be more compact.

apply() applies a function over rows/columns of a matrix.

lapply(), sapply() and vapply() iterate over a list or vector. vapply() is recommended for programming as it specifies the type of return value

vapply(list(a = 1:3, b = 1:6), FUN = mean, FUN.VALUE = numeric(1))
  a   b 
2.0 3.5 

mapply() iterates over two or more lists/vectors in parallel.

Iteration function resources

  • Efficient R by Colin Gillespie and Robin Lovelace

  • The built-in help pages. You can directly access the examples using the example() function, e.g. to run the apply() examples, use example("apply").

  • This StackOverflow answer, describing when, where and how to use each of the functions.

  • This blog post by Neil Saunders

purrr

The purrr package (part of the tidyverse) provides alternatives to the apply family that have a simpler, more consistent interface with fixed type of return value.

# Split a data frame into pieces, 
# fit a model to each piece, summarise and extract R^2
library(purrr)

mtcars %>%
  split(.$cyl) %>% # base R
  map(~ lm(mpg ~ wt, data = .x)) %>% # returns a list
  map(summary) %>%
  map_dbl("r.squared") # returns a vector
        4         6         8 
0.5086326 0.4645102 0.4229655 

Note: We do need %>% not |> here. See here for more details.

Advantages of purrr

  • The first argument is always the data, so purrr works naturally with the pipe.

  • All purrr functions are type-stable. They always return the advertised output type (e.g. map() returns lists; map_dbl() returns double vectors), or they throw an error.

  • All map() functions either accept function, formulas (used for succinctly generating anonymous functions), a character vector (used to extract components by name), or a numeric vector (used to extract by position).

See the iteration chapter of R for Data Science for further examples and details

Efficient R programming

Growing objects

Adding to an object in a loop, e.g. via c() or cbind()

res <- NULL
for (i in 1:10000) res <- c(res, 1)

forces a copy to be made at each iteration. THIS IS BAD!

It is far better to create an object of the necessary size first

res <- numeric(10000)
for (i in seq_along(res)) res[i] <- 1

To initialise a list we can use

res <- vector(mode = "list", length = 100)

Benchmarking

There will usually be many ways to write code for a given task. To compare alternatives, we can benchmark the expression

library(rbenchmark)
benchmark({res <- NULL;
           for (i in 1:10000) res <- c(res, 1)})$elapsed
[1] 11.632
benchmark({res <- numeric(10000)
           for (i in seq_along(res)) res[i] <- 1})$elapsed
[1] 0.081

Note the BIG difference between growing and initialising a vector (the latter around 150 times faster in this case).

for loops revisited

Each loop has three components:

  1. The output: allocate sufficient space before you start the loop

  2. The sequence: this determines what you loop over

  3. The body: the code that does the work

See https://r4ds.had.co.nz/iteration.html#for-loops

Vectorization

Vectorization is operating on vectors (or vector-like objects) rather than individual elements.

Many operations in R are vectorized, e.g.

x <- 1:3
y <- 3:1
x == y
[1] FALSE  TRUE FALSE
log(x)
[1] 0.0000000 0.6931472 1.0986123
res <- list(a = 1:3, b = 1:6)
lengths(res)
a b 
3 6 

We do not need to loop through each element!

Recycling

Vectorized functions will recycle shorter vectors to create vectors of the same length

1:4 + 0:1 # 1+0, 2+1, 3+0, 4+1
[1] 1 3 3 5

This is particularly useful for single values

cbind(1, 3:4)
     [,1] [,2]
[1,]    1    3
[2,]    1    4

and for generating regular patterns

paste0(rep(1:3, each = 2), c("a", "b"))
[1] "1a" "1b" "2a" "2b" "3a" "3b"

Your turn!

  1. Write a for loop to compute the mean of every column of in mtcars, saving each to a preallocated vector

  2. Use lapply() with rnorm to generate a list of length 10 where the 1st item contains a vector of 1 sample from an \(N(0,1)\) distribution, the 2nd item contains a vector of 2 samples from an \(N(0,1)\) distribution up to the 10th item contains a vector of 10 samples from an \(N(0,1)\) distibution.

  3. Use lapply() with rnorm to generate a list of length 10, where the 1st item contains a vector of 5 samples from \(N(1,1)\), the 2nd item contains a vector of 5 samples \(N(2,1)\) and so on until you get 5 samples from \(N(10,1)\)

Your turn! (sample solutions)

out <- numeric(ncol(mtcars))

for (i in seq_len(ncol(mtcars))) {
  out[i] <- mean(mtcars[[i]])
}
lapply(1:10, rnorm)
lapply(1:10, rnorm, n = 5)

Vectorization and matrices

Vectorizations applies to matices too, not only through matrix algebra

M <- matrix(1:4, nrow = 2, ncol = 2)
M + M
     [,1] [,2]
[1,]    2    6
[2,]    4    8

but also vectorized functions

M <- M + 1.3
round(M)
     [,1] [,2]
[1,]    2    4
[2,]    3    5

Matrices and recycling: rows

Values are recycled down matrix, which is convenient for row-wise operations

M <- matrix(1:6, nrow = 2, ncol = 3)
M
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
M - 1:2
     [,1] [,2] [,3]
[1,]    0    2    4
[2,]    0    2    4

Matrices and recycling: columns

To do the same for columns we would need to explicitly replicate, which is not so efficient.

M - rep(1:3, each = 2)
     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    1    2    3

Vectorization vs for loop

Operations that can be vectorized will be more efficient than a loop in R

M <- matrix(1:100000, nrow = 200, ncol = 500)
x <- 1:200
benchmark({for (i in 1:200){
             for (j in 1:500){
               M[i, j] <- M[i, j] - x[i]
             }
           }})$elapsed
[1] 0.743
benchmark({M - x})$elapsed
[1] 0.024

The latter is nearly 30 times faster!

Row/column-wise operations

Several functions are available implementing efficient row/column-wise operations, e.g. colMeans(), rowMeans(), colSums(), rowSums(), sweep()

M <- matrix(1:4, nrow = 2, ncol = 2)
rowMeans(M)
[1] 2 3

These provide an alternative to iterating though rows and columns in R (the iteration happens in C, which is faster).

The matrixStats package provides further “matricised” methods.

Top tips for efficient programming

A golden rule in R programming is to access the underlying C/Fortran routines as quickly as possible; the fewer functions calls required to achieve this, the better.

  • Be careful never to grow vectors

  • Vectorise code wherever possible

See Efficient Programming for more details and examples.

We will also expand on this topic in the Advanced R course next term.

Writing functions

Components of a function

Functions are defined by three components:

  • the name of the function
  • the arguments of the function, inside ( )
  • the body of the function that computes the result, inside { }

They are created using function()

t_statistic <- function(n) {
    x <- rnorm(n)
    y <- rnorm(n)
    t.test(x, y)$statistic
}

Naming functions

As with arguments, function names are important:

  • use a name that describes what it returns (e.g. t_statistic) or what it does (e.g. remove_na)

  • try to use one convention for combining words (e.g. snake case t_statistic or camel case tStatistic)

  • avoid using the same name as other functions

Specified arguments

specified arguments are those named in the function definition, e.g.  in rnorm()

args(rnorm)
function (n, mean = 0, sd = 1) 
NULL

the arguments are n, mean and sd.

mean and sd have been given default values in the function definition, but n has not, so the function fails if the user does not pass a value to n

rnorm()
Error in rnorm(): argument "n" is missing, with no default

Name and order of arguments

The user can pass objects to these arguments using their names or by supplying unnamed values in the right order

rnorm(5, 1, 10)
[1]   8.114923 -14.450177 -11.770238 -17.077964  -8.352875
rnorm(5, sd = 10)
[1] -10.551049  -4.638178   1.365143   9.578195  -1.467470

So naming and order is important! Some guidelines

  • put compulsory arguments first, e.g. data
  • put rarely used arguments last, e.g. tolerance setting
  • use short but meaningful argument names
  • if relevant, use the same argument names as similar functions

Using arguments

Arguments are used as objects in the function code.

An new environment is created each time the function is called, separate from the global workspace.

x <- 1
y <- 3
f <- function(x, y){
    a <- 1
    x <- x + a
    x + y
}
f(x, y)
[1] 5
x
[1] 1
a
Error in eval(expr, envir, enclos): object 'a' not found

Lexical scoping

If an object is not defined within the function, or passed in as an argument, R looks for it in the parent environment where the function was defined

x <- 1
y <- 3
f <- function(x){
    x + y
}
f(x)
[1] 4
rm(y)
f(x)
Error in f(x): object 'y' not found

It is safest (and best practice) to use arguments rather than depend on global variables!

Return values (single)

By default, functions return the object created by the last line of code

f <- function(x) {
    x <- x + 1
    log(x)
}

Alternatively return() can be used to terminate the function and return a given object

f <- function(x) {
    if (all(x > 0)) return(log(x))
    x[x <= 0] <- 0.1
    log(x)
}

Return values (multiple)

Multiple objects can be returned in a list:

mean_and_sd <- function(x) {
  res_mean <- mean(x, na.rm = TRUE)
  res_sd <- sd(x)
  
  list(mean = res_mean, sd = res_sd)
}

mean_and_sd(1:3)
$mean
[1] 2

$sd
[1] 1

We use a list so that we can return different types of objects simultaneously, and access them easily with $, e.g. mean_and_sd(1:3)$mean returns 2.

Your turn!

Write your own function, variance, to compute the variance of a numeric vector:

\[ Var(x) = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2 \]

Make use of R’s built in vectorisation.

Test it and compare your answer with the built-in var() function.

Your turn! (a solution)

variance <- function(x) {
  1/(length(x) - 1) * sum((x - mean(x))^2)
}

End matter

Resources

Material (very largely) inspired by and remixed from:

Additionally:

License

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Extra stuff we probably won’t get to

(Use in Fuctions session of Advanced R course in Term 2 instead)

Unspecified Arguments

... or the ellipsis allow unspecified arguments to be passed to the function.

This device is used by functions that work with arbitrary numbers of objects, e.g. 

args(sum)
function (..., na.rm = FALSE) 
NULL
sum(1, 4, 10, 2)
[1] 17

It can also be used to pass on arguments to another function, e.g.

t_statistic <- function(x, g, ...) {
    t.test(x ~ g, ...)$stat
}

Using ...

Arguments passed to ... can be collected into a list for further analysis

means <- function(...){
    dots <- list(...)
    vapply(dots, mean, numeric(1), na.rm = TRUE)
}
x <- 1
y <- 2:3
means(x, y)
[1] 1.0 2.5

Similarly the objects could be concatenated using c()

Side Effects

A side-effect is a change outside the function that occurs when the function is run, e.g.

  • plot to the graphics window or other device
  • printing output to the console
  • write data to a file

A function can have many side-effects and a return value, but it is best practice to have a separate function for each task, e.g creating a plot or a table.

Writing to file is usually best done outside a function.