R Foundations 2024
October 24, 2024
Data types
Data structures
Data import and wrangling
The assignment operator in R is <-
We can create objects in R and assign them names:
Then we can inspect the objects we have created:
And use them further:
Object names cannot:
start with a number
contain certain characters like ,
-
?
contain a space (unless in ``
, but that is not best practice)
day_one
day_1
i_use_snake_case
other.people.use.periods
evenOthersUseCamelCase
The tidyverse has popularised the use of snake_case
. Camel case is a better option for screen readers. The use of periods is discouraged because periods have other uses in R.
foo
bar
first_day_of_month
dayone
Assigning and environment pane.
There is an RStudio shortcut for <-
which also puts spaces around it:
Alt/⌥ + -
character: "a"
, "hello, world!"
double: 3
, 3.14
, pi
integer: 3L
(the L
tells R to store this as an interger)
logical: TRUE
and FALSE
complex: 3+2i
. N.B. need to write 1i
for \(\sqrt(-1)\).
raw: holds raw bytes (rarely used)
N.B. double and integer types are both numeric
NA
: The value NA
is given to any data which R knows to be missing. It is not a character string, i.e. it is different to "NA"
Inf
: Positive infinity, e.g. the result of dividing a non-zero number by zero
NaN
: Not a number, e.g. attempting to find the logarithm of a negative number
NULL
: The null object. Often returned by expressions and functions whose value is undefined
Data structures are the building blocks of R code.
In R, the main types of structures are
vectors
factors
matrices and arrays
lists
data frames
Focus today on vectors, factors and data frames.
A single number is a special case of a numeric vector. Vectors of length greater than one can be created using the concatenate function, c
.
The elements of the vector must be of the same type: common types are numeric, character and logical.
There are built-in functions for getting information about vectors, e.g.
There are some useful shortcuts for certain types of vector:
[1] 1 2 3 4 5
[1] 3.0 3.5 4.0 4.5 5.0
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
What do you think letters
returns?
We subset vectors using []
:
In your R-Foundations project from last week, create and save a new script called data-types.R
Look at the help page for the rep()
function: ?rep
Starting with the vector x <- c(1,3,6)
, can you make the following patterns:
What does rep(x, 2, 2)
give? Is it what you expected? Can you explain the output?
Factors are used to represent categorical data. They can be ordered or unordered.
Factors are stored as integers, and have labels associated with these unique integers. While factors usually look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
[1] apple apple pear
Levels: apple pear
Factor w/ 2 levels "apple","pear": 1 1 2
The forcats package from the tidyverse has many functions for dealing with factors.
Data sets are stored in R as data frames
These are structured as a list of objects, typically vectors, of the same length.
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
From the tibble page:
A tibble, or
tbl_df
, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.
Spot the differences!
Using the Import Dataset dialog in RStudio
we can import files stored locally or online in the following formats:
.txt
/.csv
via read_delim
/read_csv
from readr..xlsx
via read_excel
from readxl..sav/.por
, .sas7bdat
and .dta
via read_spss
, read_sas
and read_stata
respectively from haven.Most of these functions also allow files to be compressed, e.g. as .zip
.
It’s REALLY important to have good file names and paths, and a good project structure.
I leave you in the extremely capable hand of Danielle Navarro to take you thoroughly through best practices:
https://djnavarro.net/slides-project-structure/#1
I also HIGHLY recommend you check out the here package, which enables easy file referencing in project-oriented workflows.
The rio package provides a common interface to the functions used by Import Dataset as well as many others.
The data format is automatically recognised from the file extension. To read the data in as a tibble, we use the setclass argument.
See ?rio
for the underlying functions used for each format and the corresponding optional arguments, e.g. the skip argument to read_excel
to skip a certain number of rows.
Show both button (in Environment and from the file itself) and code
From file
data/penguins_lter.csv
From URL
Your turn!
The dplyr package (part of the tidyverse) provides the following key functions to operate on data frames:
filter()
arrange()
select()
mutate()
summarise()
They all take a data frame as their first argument. The subsequent arguments describe what to do with the data frame. The result is a new data frame.
filter()
: pick rows based on values of observations.# A tibble: 39 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 50 16.3 230 5700
2 Gentoo Biscoe 50 15.2 218 5700
3 Gentoo Biscoe 49 16.1 216 5550
4 Gentoo Biscoe 49.3 15.7 217 5850
5 Gentoo Biscoe 49.2 15.2 221 6300
6 Gentoo Biscoe 48.7 15.1 222 5350
7 Gentoo Biscoe 50 15.3 220 5550
8 Gentoo Biscoe 59.6 17 230 6050
9 Gentoo Biscoe 48.4 16.3 220 5400
10 Gentoo Biscoe 48.7 15.7 208 5350
# ℹ 29 more rows
# ℹ 2 more variables: sex <fct>, year <int>
variable names are unquoted
building blocks of conditions:
Building block | R code |
---|---|
Binary comparisons | > , < , == , <= , >= , != |
Logical operators | or | , and & , not ! |
Value matching | e.g. x %in% 6:9 |
Missing indicator | e.g. is.na(x) |
select()
: select variables (columns) in a dataset# A tibble: 344 × 4
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<dbl> <dbl> <int> <int>
1 39.1 18.7 181 3750
2 39.5 17.4 186 3800
3 40.3 18 195 3250
4 NA NA NA NA
5 36.7 19.3 193 3450
6 39.3 20.6 190 3650
7 38.9 17.8 181 3625
8 39.2 19.6 195 4675
9 34.1 18.1 193 3475
10 42 20.2 190 4250
# ℹ 334 more rows
# A tibble: 344 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<dbl> <dbl> <int> <int> <int>
1 39.1 18.7 181 3750 2007
2 39.5 17.4 186 3800 2007
3 40.3 18 195 3250 2007
4 NA NA NA NA 2007
5 36.7 19.3 193 3450 2007
6 39.3 20.6 190 3650 2007
7 38.9 17.8 181 3625 2007
8 39.2 19.6 195 4675 2007
9 34.1 18.1 193 3475 2007
10 42 20.2 190 4250 2007
# ℹ 334 more rows
There are several other selectors. See ?dplyr::select
or online for further details.
|>
vs %>%
Pipes pass what comes before into an argument (by default the first) of what comes after.
Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
|>
%>%
By default, a pipe takes what comes before and pass it to first argument of what comes after.
So far, so good, but what if we want to pipe into a subsequent argument?
Different placeholder (.
vs _
) and with native pipe need a named argument
There is an RStudio shortcut for the pipe which also puts spaces around it:
Ctrl/⌘ + ⇧ + M.
This can be set to either %>%
or |>
in the RStudio preferences.
Go to Tools -> Global Options -> Code and check/uncheck box for “Use native pipe operator”.
arrange()
: change the ordering of rows# A tibble: 344 × 3
species sex flipper_length_mm
<fct> <fct> <int>
1 Adelie female 172
2 Adelie female 174
3 Adelie female 176
4 Adelie female 178
5 Adelie male 178
6 Adelie female 178
7 Chinstrap female 178
8 Adelie <NA> 179
9 Adelie <NA> 180
10 Adelie male 180
# ℹ 334 more rows
# A tibble: 344 × 3
species sex flipper_length_mm
<fct> <fct> <int>
1 Adelie female 172
2 Adelie female 174
3 Adelie female 176
4 Adelie female 178
5 Adelie male 178
6 Adelie female 178
7 Adelie <NA> 179
8 Adelie <NA> 180
9 Adelie male 180
10 Adelie male 180
# ℹ 334 more rows
# A tibble: 344 × 3
species sex flipper_length_mm
<fct> <fct> <int>
1 Gentoo male 231
2 Gentoo male 230
3 Gentoo male 230
4 Gentoo male 230
5 Gentoo male 230
6 Gentoo male 230
7 Gentoo male 230
8 Gentoo male 230
9 Gentoo male 229
10 Gentoo male 229
# ℹ 334 more rows
mutate()
: create and modify columnspenguins |>
filter(species == "Gentoo") |>
select(sex, flipper_length_mm) |>
mutate(size = if_else(flipper_length_mm > 217, "big", "small"))
# A tibble: 124 × 3
sex flipper_length_mm size
<fct> <int> <chr>
1 female 211 small
2 male 230 big
3 female 210 small
4 male 218 big
5 male 215 small
6 female 210 small
7 female 211 small
8 male 219 big
9 female 209 small
10 male 215 small
# ℹ 114 more rows
penguins |>
select(bill_length_mm) |>
filter(!is.na(bill_length_mm)) |>
mutate(bill_length_mm_cumsum = cumsum(bill_length_mm))
# A tibble: 342 × 2
bill_length_mm bill_length_mm_cumsum
<dbl> <dbl>
1 39.1 39.1
2 39.5 78.6
3 40.3 119.
4 36.7 156.
5 39.3 195.
6 38.9 234.
7 39.2 273
8 34.1 307.
9 42 349.
10 37.8 387.
# ℹ 332 more rows
summarise()
: reduces multiple values down to a single summarypenguins |>
group_by(species, sex) |>
filter(!is.na(sex)) |>
summarise(mean = mean(body_mass_g, na.rm = TRUE)) |> # give column a name
ungroup() # best practice after group_by()
# A tibble: 6 × 3
species sex mean
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
penguins |>
filter(!is.na(sex)) |>
summarise(mean = mean(body_mass_g, na.rm = TRUE),
.by = c(species, sex)) # new in dplyr 1.1.0, note the `.`
# A tibble: 6 × 3
species sex mean
<fct> <fct> <dbl>
1 Adelie male 4043.
2 Adelie female 3369.
3 Gentoo female 4680.
4 Gentoo male 5485.
5 Chinstrap female 3527.
6 Chinstrap male 3939.
Explore the wheels
data!
Share your code on https://developer.r-project.org/etherpad/p/r-foundations-2024
Tidy Data Tutor lets you write R and Tidyverse code in your browser and see how your data frame changes at each step of a data analysis pipeline.
DEMO
Material inspired by and remixed from:
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).