Functions

R Programming

Ella Kaye and Heather Turner
Department of Statistics, University of Warwick

March 18, 2024

Overview

  • Functions basics
  • Beyond the basics
  • Error handling

Functions basics

Components of a function

Functions are defined by three components:

  • the name of the function
  • the arguments of the function, inside ( )
  • the body of the function that computes the result, inside { }

They are created using function()

t_statistic <- function(n) {
    x <- rnorm(n)
    y <- rnorm(n)
    t.test(x, y)$statistic
}

Naming functions

Function names are important:

  • use a name that describes what it returns (e.g. t_statistic) or what it does (e.g. remove_na)

  • try to use one convention for combining words (e.g. snake case t_statistic or camel case tStatistic)

  • avoid using the same name as other functions

Specified arguments

specified arguments are those named in the function definition, e.g.  in rnorm()

args(rnorm)
function (n, mean = 0, sd = 1) 
NULL

the arguments are n, mean and sd.

mean and sd have been given default values in the function definition, but n has not, so the function fails if the user does not pass a value to n

rnorm()
Error in rnorm(): argument "n" is missing, with no default

Name and order of arguments

The user can pass objects to these arguments using their names or by supplying unnamed values in the right order

rnorm(5, 1, 10)
[1]  5.327188 -2.238011 -5.107330  4.713281 11.541684
rnorm(5, sd = 10)
[1] -18.464723   9.855951  -1.633670  10.121746 -10.506900

So naming and order is important! Some guidelines

  • put compulsory arguments first, e.g. data
  • put rarely used arguments last, e.g. tolerance setting
  • use short but meaningful argument names
  • if relevant, use the same argument names as similar functions

Return values (single)

By default, functions return the object created by the last line of code

f <- function(x) {
    x <- x + 1
    log(x)
}

Alternatively return() can be used to terminate the function and return a given object

f <- function(x) {
    if (all(x > 0)) return(log(x))
    x[x <= 0] <- 0.1
    log(x)
}

Return values (multiple)

Multiple objects can be returned in a list:

mean_and_sd <- function(x) {
  res_mean <- mean(x, na.rm = TRUE)
  res_sd <- sd(x)
  
  list(mean = res_mean,
       sd = res_sd)
}

x <- 1:3
mean_and_sd(x)
$mean
[1] 2

$sd
[1] 1

Multiple return values, different types

symmetric_sums <- function(mat) {
  sym_mat <- mat + t(mat)
  sum <- sum(sym_mat)
  list(sym_mat = sym_mat, sum = sum)
}
mat <- matrix(1:9, 3, 3)
symmetric_sums(mat)
$sym_mat
     [,1] [,2] [,3]
[1,]    2    6   10
[2,]    6   10   14
[3,]   10   14   18

$sum
[1] 90

RStudio helper

RStudio has a helper to turn code into a function:

  1. Select the lines of code that will become the body of the function.
  2. Select “Code” > “Extract Function” from the menu.
  3. Enter the name of the new function in the dialog box.
  4. Edit the arguments if required.
  5. Add/edit the last line to specify the return value.

Exercise files

https://github.com/Warwick-Stats-Resources/Advanced-R-exercises

  • You can get the files by creating a new project from version control in RStudio (if set up)

  • By going to the ‘Code’ button in the repo, then ‘Download ZIP’, then opening Advanced-R-exercises.Rproj.

Exercise 1

In the qq_norm chunk of exercises.Rmd there is some code to compute the slope and intercept of the line to add to a quantile-quantile plot, comparing sample quantiles against theoretical quantiles of a N(0, 1) distribution.

Turn this code into a function named qq_norm taking the sample data as an argument and returning the slope and intercept in a list.

Run this chunk to source the function, then run the normal-QQ chunk which uses the qq_norm function to compute parameters for an example plot.

Beyond the basics

Function environment

Arguments are used as objects in the function code.

A new environment is created each time the function is called, separate from the global workspace.

x <- 1
y <- 3
f <- function(x, y){
    a <- 1
    x <- x + a
    x + y
}
f(x, y)
[1] 5
x
[1] 1
a
Error in eval(expr, envir, enclos): object 'a' not found

Lexical scoping

If an object is not defined within the function, or passed in as an argument, R looks for it in the parent environment where the function was defined

x <- 1
y <- 3
f <- function(x){
    x + y
}
f(x)
[1] 4
rm(y)
f(x)
Error in f(x): object 'y' not found

It is safest (and best practice) to use arguments rather than depend on global variables!

Unspecified arguments

... (the ellipsis) allow unspecified arguments to be passed to the function.

This device is used by functions that work with arbitrary numbers of objects, e.g. 

args(sum)
function (..., na.rm = FALSE) 
NULL
sum(1, 4, 10, 2)
[1] 17

It can also be used to pass on arguments to another function, e.g.

t_statistic <- function(x, g, ...) {
    t.test(x ~ g, ...)$stat
}

Using ...

Arguments passed to ... can be collected into a list for further analysis

means <- function(...){
    dots <- list(...)
    vapply(dots, mean, numeric(1), na.rm = TRUE)
}
x <- 1
y <- 2:3
means(x, y)
[1] 1.0 2.5

Similarly the objects could be concatenated using c()

Side effects

A side-effect is a change outside the function that occurs when the function is run, e.g.

  • plot to the graphics window or other device
  • printing output to the console
  • write data to a file

A function can have many side-effects and a return value, but it is best practice to have a separate function for each task, e.g creating a plot or a table.

Writing to file is usually best done outside a function.

Exercise 2

Copy your qq_norm function to the qq chunk and rename it qq.

Add a new argument qfun to specify any quantile function (e.g. qt, qf, etc). Give it the default value qnorm.

In the body of qq, use qfun instead of qnorm to compute q_theory. Use ... to pass on arguments to qfun.

Run the qq chunk and test your function on the t-QQ chunk.

Using functions from other packages

In our own functions (outside of packages), it is possible to use library

scale_rows <- function(X){
    library(matrixStats)
    X <- X - rowMeans(X)
    X/rowSds(X)
}

But this loads the entire package, potentially leading to clashes with functions from other packages. It is better to use the import package:

scale_rows <- function(X){
    import::from(matrixStats, rowSds)
    X <- X - rowMeans(X)
    X/rowSds(X)
}
scale_rows(matrix(1:12, nrow = 3))

Custom ggplot

ggplot2, like dplyr and other tidyverse packages, uses non-standard evaluation, that is, it refers to variable names in a data frame as if they were objects in the current environment

ggplot(mtcars, aes(x = mpg, y= disp)) +
    geom_point()

To emulate this, we have to need to embrace arguments

ggscatter <- function(data, x, y){
    import::from(ggplot2, ggplot, aes, geom_point)
    ggplot(data, aes(x = {{ x }}, y = {{ y }})) +
        geom_point()
}
ggscatter(mtcars, x = mpg, y = disp)

Externalizing function code

It is a good idea to separate function code from analysis code.

Put related functions together and source as required

source("modelFunctions.R")
source("plotFunctions.R")

The import package enables only necessary, top-level functions to be imported to the global workspace:

import::here(poissonModel, quasiPoissonModel, .from = "modelFunctions.R")

In either case, import::from commands can be put outside the function body to make the code easier to read.

Error handling

Sanity checks

To avoid mistakes, you may want to add some basic sanity checks

logit <- function(p){
    stopifnot(p > 0 & p < 1)
    log(p/(1 - p))
}
logit(2)
Error in logit(2): p > 0 & p < 1 is not TRUE
logit(0.5)
[1] 0

Error messages

Often the R messages can be quite obscure

zap <- function(x) if (max(x) < 1e7) 0 else x
x <- c(1, 2, NA)
zap(x)
Error in if (max(x) < 1e+07) 0 else x: missing value where TRUE/FALSE needed

More helpful error message can be implemented using stop

zap <- function(x) {
    if (any(is.na(x))) stop("missing values in x\nare", 
                            " not allowed")
    if (max(x) < 1e7) 0 else x
}
zap(x)
Error in zap(x): missing values in x
are not allowed

Warning messages

Warning messages should be given using warning()

safe_log2 <- function(x) {
    if (any(x == 0)) {
        x[x == 0] <- 0.1
        warning("zeros replaced by 0.1")
    }
    log(x, 2)
}
safe_log2(0:1)
[1] -3.321928  0.000000

Other messages can be printed using message().

Suppressing warnings

If a warning is expected, you may wish to suppress it

log(c(3, -1))
[1] 1.098612      NaN
x <- suppressWarnings(log(c(3, -1)))

All warnings will be suppressed however!

Similarly suppressMessages() will suppress messages.

Catching errors/warnings

The purrr package has various functions to catch issues.

possibly() lets you modify a function to return a specified value when there is an error

log("a")
Error in log("a"): non-numeric argument to mathematical function
library(purrr)
poss_log <- possibly(log, otherwise = NA)
poss_log("a")
[1] NA

safely() works in a similar way but returns a list with elements "result" and "error", so you can record the error message(s).

quietly() lets you modify a function to return printed output, warnings and messages along with the result.

traceback()

When an unexpected error occurs, there are several ways to track down the source of the error, e.g. traceback()

f1 <- function(x){ f2(x) }
f2 <- function(x){ x + qqqq }
f1(10)
Error in f2(x): object 'qqqq' not found
traceback()
2: f2(x) at #1
1: f1(10)

traceback() in RStudio

In RStudio, if Debug > On Error > Error Inspector is checked and the traceback has at least 3 calls, the option to show traceback is presented

f1 <- function(x){ f2(x) }
f2 <- function(x){ f3(x) }
f3 <- function(x){ x + qqqq }
f1(10)
Error in f3(x): object 'qqqq' not found

RStudio error inspector giving options to show traceback or rerun with debug

debugonce()

debugonce() flags a function for debugging the next time it is called

debugonce(f2)
f1(10)
Error in f3(x): object 'qqqq' not found
debugging in: f2(x)
debug at #1: {
    x + qqqq
}
Browse[2]> ls()
[1] "x"
Browse[2]> x
[1]> 10

When in debug mode type n or to step to the next line and c to continue to the end of a loop or the end of the function. Q quits the debugger.

Breakpoints

Stepping through a function line by line can be tedious. In RStudio we can set custom breakpoints in the source pane

Set breakpoint in RStudio

Source the code

Start debugging from breakpoints

RStudio’s Rerun with Debug

The Rerun with Debug option will rerun the command that created the error and enter debug mode where the error occurred.

Good points:

  • Easy to enter debug mode (when option shown)
  • Can click in Traceback pane to view objects at any point in the call stack

Bad points:

  • May have gone past source of error (use breakpoints instead)
  • May enter deeply nested function: use recover() to select an earlier entry point

Alternatively use options(error = recover), run code to debug, then set options(error = NULL).

Exercise 3

Open debug_practice.R and source the function f().

Try to run f(10) - there’s an error! Use traceback() to see which function call generated the error, then fix the problem.

Run f(10) again - there is another error! Can you fix this directly given the error message?

Try running f(1) - is the result what you expected? Use debugonce() to set debugging on f() and re-run f(1). Step through the function, printing each object as it is created to see what is happening.

Can you think how to improve the function? See if you can modify the function to give a sensible result for any integer.

End matter

References

License

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Documenting and testing

Documenting functions

Comments help to record what a function does

# reorder x by grouping variable g
groupSort <- function(x, g) {
    ord <- order(g) #indices for ascending order of g
    x[ord]
}

The docstring package enables roxygen comments to be turned into a help file

library(docstring)
groupSort <- function(x, g) {
    #' Reorder a Vector by a Grouping Variable
    #'
    #' @param x a vector
    #' @param g a grouping variable
    ord <- order(g) #indices for ascending order of g
    x[ord]
}

View the help file as usual

?groupSort

HTML documentation generated by docstring

For fuller documentation, see the docstring vignette.

ADD LINK TO VIGNETTE

roxygen

ADD A SLIDE WITH SOME RESOURCES ABOUT ROXYGEN

Validation

When developing a function, we will want to validate its output.

A simple approach is to try different inputs

log_2 <- function(x){
    log(x, 2)
}
log_2(2^2)
[1] 2
log_2(2^0)
[1] 0

Doing this each time we change the function becomes tedious to check and error-prone as we miss important tests.

Unit testing

The testthat packages allows us to create a test suite:

ADD LINK

context("log_2 works correctly")

test_that("log_2 returns log to base 2", {
    expect_equal(log_2(2^3), 3)
    expect_equal(log_2(2^0), 0)
})

test_that("negative values give error", {
    expect_error(log_2(2^-1))
})

Running tests

If we save the tests in a file, e.g. tests.R, we can use test_file() to run and check all tests:

library(testthat)
test_file("tests.R")
√ | OK F W S | Context
x |  2 1     | log_2 works correctly
--------------------------------------------------------------------------------
tests.R:9: failure: negative values give error
`log_2(2^-1)` did not throw an error.
--------------------------------------------------------------------------------

== Results =====================================================================
OK:       2
Failed:   1
Warnings: 0
Skipped:  0

Your turn!

Copy the qq function to a new R script and save as functions.R. Add roxygen comments at the start of the function body to define a title and parameter documentation.

Run the documentation chunk of exercises.Rmd to view your documentation.

Open the tests.R script. Using expect_equal add some tests for the following

  • a sample of 100,000 from N(0, 1) gives approximately slope 1, intercept 0
  • a sample of 100,000 from N(0, 1/2) gives approximately slope 2, intercept 0
  • sample of 100,000 from N(2, 1) gives approximately slope 1, intercept -2

Use the tol argument in expect_equal to set a tolerance of 0.01.

Run the tests chunk of exercises.Rmd to run your tests with test_file. Try changing the expected tolerance to get a test to fail.