[1] 0 0 0
R Foundations Course
November 7, 2023
More on data structures
Control flow and iteration functions
Efficient R programming
Writing functions (basics)
Understanding the basics of R programming helps to improve analysis/reporting scripts and extend what we can do with R.
Good coding practice follows the DRY principle: Don’t Repeat Yourself. Rather than modifying copy-pasted code chunks, we might
Custom functions can be used to provide convenient wrappers to complex code chunks as well as implement novel functionality.
For basic data analysis, our data is usually imported and we use high-level functions (e.g. from dplyr) to handle it.
For programming, we need to work with lower-level data structures and be able to
Working with base R functions when programming also helps avoid dependencies, which is useful when writing packages.
numeric()
, character()
and logical()
can be used to initialize vectors of the corresponding type for a given length
Elements can be assigned by indexing the positions to be filled, e.g.
This is particularly useful when programming an iterative procedure.
as.logical()
, as.numeric()
and as.character()
coerce to the corresponding type, producing NA
s if coercion fails.
Logical vectors are commonly used when indexing. The vector might be produced by a logical operator:
duplicated()
is also useful here:
The are several convenience function for creating numeric vectors, notably seq()
and rep()
.
As they are so useful there are fast shortcuts for particular cases
Character vectors may be used for creating names
A1229 B1230 C1231
3 4 5
[1] "A1229" "B1230" "C1231"
Names can be used as an alternative to numeric or logical vectors when indexing
A matrix is in fact also a vector, with an attribute giving the dimensions of the matrix
Useful functions for matrices include dim()
, ncol()
, nrow()
, colnames()
and rownames()
. rbind()
and cbind()
can be used to row-bind or column-bind vectors.
Matrices enable computation via matrix algebra as well as row/column-wise operations.
Lists collect together items which may be different types or lengths. Like a vector, elements may be named.
$matrix
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
$vector
A1229 B1230 C1231
3 4 5
Lists are often used to return the results of a function.
Elements can be indexed by [
to return a list or [[
to return a single element, either by index or name:
$vector
A1229 B1230 C1231
3 4 5
A1229 B1230 C1231
3 4 5
$
can be used to extract elements by name:
Data frames are lists of variables of the same length and hence can often be treated as a matrix
The lm
function calls the “workhorse” function lm.fit
to actually fit the model. Unlike lm
, which works from a formula, lm.fit
works from the model matrix and the response vector.
Define a response y
containing 10 numeric values. Define an explanatory variable z
of the numbers 1 to 10.
Use the function cbind()
to create a matrix x
with 1
s in the first column and z
in the second column.
Fit a model using fit1 <- lm.fit(x, y)
. Use str
to explore the structure of the results. Use $
to extract the coefficients.
Create a second fit using lm(y ~ z)
. Use names
to compare the results. Check the coefficients of the second fit are the same.
~
notationThe ~
notation can be used to specify a model formula, where the LHS is the response and the RHS are a collection of predictors, e.g.
body_mass_g ~ bill_length_mm + flipper_length_mm
.
This can be used as the formula
argument when fitting a model, e.g. a linear model:
In the formula, crucially, the +
does not have to mean it is a linear additive model and should be read more as “this variable and this one etc” which the model function might use additively (e.g. lm
) or might not (e.g. many ML models).
Control structures are the commands that make decisions or execute loops.
Conditional execution: if
/else
, switch
Loops: for
, while
, repeat
if
/else
An if
statement can stand alone or be combined with an else
statement
The condition must evaluate to logical vector of length one. The functions all()
, any()
, is.na()
, is.null()
and other is.
functions are useful here.
Using ==
may not be appropriate as it compares each element; identical()
will test the whole object
all.equal()
will allow for some numerical tolerance.
switch
The switch()
function provides a more readable alternative to nested if
statements
x <- 1:5
switch("range", # can enter an arg name or position
IQR = IQR(x),
range = range(x),
mean(x))
[1] 1 5
The final unnamed argument is the default. Further examples
for
A for
loop repeats a chunk of code, iterating along the values of a vector or list
Unassigned objects are not automatically printed; hence call to print()
.
seq_along()
is used here rather than 1:length(x)
as length(x)
may be zero. message
is used to print messages to the console.
while
and repeat
The while
loop repeats while a condition is TRUE
The repeat
loop repeats until exited by break
break
can be used in for
or while
loops too.
next
can be used to skip to the next iteration.
Iteration functions provide a general alternative to for loops. They are not necessarily faster, but can be more compact.
apply()
applies a function over rows/columns of a matrix.
lapply()
, sapply()
and vapply()
iterate over a list or vector. vapply()
is recommended for programming as it specifies the type of return value
mapply()
iterates over two or more lists/vectors in parallel.
Efficient R by Colin Gillespie and Robin Lovelace
The built-in help pages. You can directly access the examples using the example()
function, e.g. to run the apply()
examples, use example("apply")
.
This StackOverflow answer, describing when, where and how to use each of the functions.
This blog post by Neil Saunders
The purrr package (part of the tidyverse) provides alternatives to the apply family that have a simpler, more consistent interface with fixed type of return value.
# Split a data frame into pieces,
# fit a model to each piece, summarise and extract R^2
library(purrr)
mtcars %>%
split(.$cyl) %>% # base R
map(~ lm(mpg ~ wt, data = .x)) %>% # returns a list
map(summary) %>%
map_dbl("r.squared") # returns a vector
4 6 8
0.5086326 0.4645102 0.4229655
Note: We do need %>%
not |>
here. See here for more details.
The first argument is always the data, so purrr works naturally with the pipe.
All purrr functions are type-stable. They always return the advertised output type (e.g. map()
returns lists; map_dbl()
returns double vectors), or they throw an error.
All map()
functions either accept function, formulas (used for succinctly generating anonymous functions), a character vector (used to extract components by name), or a numeric vector (used to extract by position).
See the iteration chapter of R for Data Science for further examples and details
Adding to an object in a loop, e.g. via c()
or cbind()
forces a copy to be made at each iteration. THIS IS BAD!
It is far better to create an object of the necessary size first
To initialise a list we can use
There will usually be many ways to write code for a given task. To compare alternatives, we can benchmark the expression
[1] 10.873
[1] 0.074
Note the BIG difference between growing and initialising a vector (the latter around 150 times faster in this case).
for
loops revisitedEach loop has three components:
The output: allocate sufficient space before you start the loop
The sequence: this determines what you loop over
The body: the code that does the work
Vectorization is operating on vectors (or vector-like objects) rather than individual elements.
Many operations in R are vectorized, e.g.
[1] FALSE TRUE FALSE
[1] 0.0000000 0.6931472 1.0986123
a b
3 6
We do not need to loop through each element!
Vectorized functions will recycle shorter vectors to create vectors of the same length
This is particularly useful for single values
and for generating regular patterns
Write a for
loop to compute the mean of every column of in mtcars
, saving each to a preallocated vector
Use lapply()
with rnorm
to generate a list of length 10 where the 1st item contains a vector of 1 sample from an \(N(0,1)\) distribution, the 2nd item contains a vector of 2 samples from an \(N(0,1)\) distribution up to the 10th item contains a vector of 10 samples from an \(N(0,1)\) distibution.
Use lapply()
with rnorm
to generate a list of length 10, where the 1st item contains a vector of 5 samples from \(N(1,1)\), the 2nd item contains a vector of 5 samples \(N(2,1)\) and so on until you get 5 samples from \(N(10,1)\)
Vectorizations applies to matices too, not only through matrix algebra
but also vectorized functions
Values are recycled down matrix, which is convenient for row-wise operations
To do the same for columns we would need to explicitly replicate, which is not so efficient.
for
loopOperations that can be vectorized will be more efficient than a loop in R
M <- matrix(1:100000, nrow = 200, ncol = 500)
x <- 1:200
benchmark({for (i in 1:200){
for (j in 1:500){
M[i, j] <- M[i, j] - x[i]
}
}})$elapsed
[1] 0.638
[1] 0.022
The latter is nearly 30 times faster!
Several functions are available implementing efficient row/column-wise operations, e.g. colMeans()
, rowMeans()
, colSums()
, rowSums()
, sweep()
These provide an alternative to iterating though rows and columns in R (the iteration happens in C, which is faster).
The matrixStats package provides further “matricised” methods.
A golden rule in R programming is to access the underlying C/Fortran routines as quickly as possible; the fewer functions calls required to achieve this, the better.
Be careful never to grow vectors
Vectorise code wherever possible
See Efficient Programming for more details and examples.
We will also expand on this topic in the Advanced R course next term.
Functions are defined by three components:
( )
{ }
They are created using function()
As with arguments, function names are important:
use a name that describes what it returns (e.g. t_statistic
) or what it does (e.g. remove_na
)
try to use one convention for combining words (e.g. snake case t_statistic
or camel case tStatistic
)
avoid using the same name as other functions
specified arguments are those named in the function definition, e.g. in rnorm()
the arguments are n
, mean
and sd
.
mean
and sd
have been given default values in the function definition, but n
has not, so the function fails if the user does not pass a value to n
The user can pass objects to these arguments using their names or by supplying unnamed values in the right order
[1] 1.1648809 -0.2815949 17.0627133 -13.0602562 -15.1622605
[1] 4.335067 -14.128370 -7.472567 -5.245040 -14.021370
So naming and order is important! Some guidelines
Arguments are used as objects in the function code.
An new environment is created each time the function is called, separate from the global workspace.
If an object is not defined within the function, or passed in as an argument, R looks for it in the parent environment where the function was defined
It is safest (and best practice) to use arguments rather than depend on global variables!
By default, functions return the object created by the last line of code
Alternatively return()
can be used to terminate the function and return a given object
Multiple objects can be returned in a list:
mean_and_sd <- function(x) {
res_mean <- mean(x, na.rm = TRUE)
res_sd <- sd(x)
list(mean = res_mean, sd = res_sd)
}
mean_and_sd(1:3)
$mean
[1] 2
$sd
[1] 1
We use a list so that we can return different types of objects simultaneously, and access them easily with $
, e.g. mean_and_sd(1:3)$mean
returns 2
.
Write your own function, variance
, to compute the variance of a numeric vector:
\[ Var(x) = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2 \]
Make use of R’s built in vectorisation.
Test it and compare your answer with the built-in var()
function.
Material (very largely) inspired by and remixed from:
Additionally:
Efficient R, Chapter 3 by Colin Gillespie and Robin Lovelace
R for Data Science, by Hadley Wickham and Garrett Grolemund, chapters on iteration and functions
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
(Use in Fuctions session of Advanced R course in Term 2 instead)
...
or the ellipsis allow unspecified arguments to be passed to the function.
This device is used by functions that work with arbitrary numbers of objects, e.g.
It can also be used to pass on arguments to another function, e.g.
...
Arguments passed to ...
can be collected into a list for further analysis
means <- function(...){
dots <- list(...)
vapply(dots, mean, numeric(1), na.rm = TRUE)
}
x <- 1
y <- 2:3
means(x, y)
[1] 1.0 2.5
Similarly the objects could be concatenated using c()
A side-effect is a change outside the function that occurs when the function is run, e.g.
A function can have many side-effects and a return value, but it is best practice to have a separate function for each task, e.g creating a plot or a table.
Writing to file is usually best done outside a function.