R Foundations Course
October 17, 2022
Data types
Data structures
Data import and wrangling
The assignment operator in R is <-
We can create objects in R and assign them names:
Then we can inspect the objects we have created:
And use them further:
Object names cannot:
start with a number
contain certain characters like ,
contain a space (unless in ``
, but that is not best practice)
is preferred in R, especially in the tidyverse.
Assigning and environment pane.
There is an RStudio shortcut for <-
which also puts spaces around it: Alt/⌥ + -
character: "a"
, "hello, world!"
double: 3
, 3.14
, pi
integer: 3L
(the L
tells R to store this as an interger)
logical: TRUE
complex: 3+2i
. N.B. need to write 1i
for \(\sqrt(-1)\).
raw: holds raw bytes (rarely used)
N.B. double and integer types are both numeric
: The value NA
is given to any data which R knows to be missing. It is not a character string, i.e. it is different to "NA"
: Positive infinity, e.g. the result of dividing a non-zero number by zero
: Not a number, e.g. attempting to find the logarithm of a negative number
: The null object. Often returned by expressions and functions whose value is undefined
Data structures are the building blocks of R code.
In R, the main types of structures are
matrices and arrays
data frames
Focus today on vectors, factors and data frames
A single number is a special case of a numeric vector. Vectors of length greater than one can be created using the concatenate function, c
The elements of the vector must be of the same type: common types are numeric, character and logical.
There are built-in functions for getting information about vectors, e.g.
There are some useful shortcuts for certain types of vector:
[1] 1 2 3 4 5
[1] 3.0 3.5 4.0 4.5 5.0
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
What do you think letters
We subset vectors using []
In your R-Foundations project from last week, create and save a new script called data-types.R
Look at the help page for the rep()
function: ?rep
Starting with the vector c(1,3,6)
, can you make the following patterns:
Factors are used to represent categorical data. They can be ordered or unordered.
Factors are stored as integers, and have labels associated with these unique integers. While factors usually look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
[1] apple apple pear
Levels: apple pear
Factor w/ 2 levels "apple","pear": 1 1 2
The forcats package from the tidyverse has many functions for dealing with factors.
Data sets are stored in R as data frames
These are structured as a list of objects, typically vectors, of the same length.
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
From the tibble page:
A tibble, or
, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.
Spot the differences!
Using the Import Dataset dialog in RStudio
we can import files stored locally or online in the following formats:
via read_delim
from readr..xlsx
via read_excel
from readxl..sav/.por
, .sas7bdat
and .dta
via read_spss
, read_sas
and read_stata
respectively from haven.Most of these functions also allow files to be compressed, e.g. as .zip
It’s REALLY important to have good file names and paths, and a good project structure.
I leave you in the extremely capable hand of Danielle Navarro to take you thoroughly through best practices:
I also HIGHLY recommend you check out the here package, which enables easy file referencing in project-oriented workflows
The rio package provides a common interface to the functions used by Import Dataset as well as many others.
The data format is automatically recognised from the file extension. To read the data in as a tibble, we use the setclass argument.
See ?rio
for the underlying functions used for each format and the corresponding optional arguments, e.g. the skip argument to read_excel
to skip a certain number of rows.
From file
From URL
Your turn!
The dplyr package (part of the tidyverse) provides the following key functions to operate on data frames:
They all take a data frame as their first argument. The subsequent arguments describe what to do with the data frame. The result is a new data frame.
: pick rows based on values of observations.# A tibble: 39 × 8
species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Gentoo Biscoe 50 16.3 230 5700 male 2007
2 Gentoo Biscoe 50 15.2 218 5700 male 2007
3 Gentoo Biscoe 49 16.1 216 5550 male 2007
4 Gentoo Biscoe 49.3 15.7 217 5850 male 2007
5 Gentoo Biscoe 49.2 15.2 221 6300 male 2007
6 Gentoo Biscoe 48.7 15.1 222 5350 male 2007
7 Gentoo Biscoe 50 15.3 220 5550 male 2007
8 Gentoo Biscoe 59.6 17 230 6050 male 2007
9 Gentoo Biscoe 48.4 16.3 220 5400 male 2008
10 Gentoo Biscoe 48.7 15.7 208 5350 male 2008
# … with 29 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
variable names are unquoted
building blocks of conditions:
Building block | R code |
Binary comparisons | > , < , == , <= , >= , != |
Logical operators | or | , and & , not ! |
Value matching | e.g. x %in% 6:9 |
Missing indicator | e.g. is.na(x) |
: select variables (columns) in a dataset# A tibble: 344 × 4
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<dbl> <dbl> <int> <int>
1 39.1 18.7 181 3750
2 39.5 17.4 186 3800
3 40.3 18 195 3250
5 36.7 19.3 193 3450
6 39.3 20.6 190 3650
7 38.9 17.8 181 3625
8 39.2 19.6 195 4675
9 34.1 18.1 193 3475
10 42 20.2 190 4250
# … with 334 more rows
# A tibble: 344 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<dbl> <dbl> <int> <int> <int>
1 39.1 18.7 181 3750 2007
2 39.5 17.4 186 3800 2007
3 40.3 18 195 3250 2007
4 NA NA NA NA 2007
5 36.7 19.3 193 3450 2007
6 39.3 20.6 190 3650 2007
7 38.9 17.8 181 3625 2007
8 39.2 19.6 195 4675 2007
9 34.1 18.1 193 3475 2007
10 42 20.2 190 4250 2007
# … with 334 more rows
There are several other selectors. See ?dplyr::select
or online for further details.
vs %>%
Pipes pass what comes before into an argument (by default the first) of what comes after.
Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
By default, a pipe takes what comes before and pass it to first argument of what comes after.
So far, so good, but what if we want to pipe into a subsequent argument?
Different placeholder (.
vs _
) and with native pipe need a named argument
: change the ordering of rows# A tibble: 344 × 3
species sex flipper_length_mm
<fct> <fct> <int>
1 Adelie female 172
2 Adelie female 174
3 Adelie female 176
4 Adelie female 178
5 Adelie male 178
6 Adelie female 178
7 Chinstrap female 178
8 Adelie <NA> 179
9 Adelie <NA> 180
10 Adelie male 180
# … with 334 more rows
# A tibble: 344 × 3
species sex flipper_length_mm
<fct> <fct> <int>
1 Adelie female 172
2 Adelie female 174
3 Adelie female 176
4 Adelie female 178
5 Adelie male 178
6 Adelie female 178
7 Adelie <NA> 179
8 Adelie <NA> 180
9 Adelie male 180
10 Adelie male 180
# … with 334 more rows
# A tibble: 344 × 3
species sex flipper_length_mm
<fct> <fct> <int>
1 Gentoo male 231
2 Gentoo male 230
3 Gentoo male 230
4 Gentoo male 230
5 Gentoo male 230
6 Gentoo male 230
7 Gentoo male 230
8 Gentoo male 230
9 Gentoo male 229
10 Gentoo male 229
# … with 334 more rows
: create and modify columnspenguins |>
filter(species == "Gentoo") |>
select(sex, flipper_length_mm) |>
mutate(size = if_else(flipper_length_mm > 217, "big", "small"))
# A tibble: 124 × 3
sex flipper_length_mm size
<fct> <int> <chr>
1 female 211 small
2 male 230 big
3 female 210 small
4 male 218 big
5 male 215 small
6 female 210 small
7 female 211 small
8 male 219 big
9 female 209 small
10 male 215 small
# … with 114 more rows
penguins |>
select(bill_length_mm) |>
filter(!is.na(bill_length_mm)) |>
mutate(bill_length_mm_cumsum = cumsum(bill_length_mm))
# A tibble: 342 × 2
bill_length_mm bill_length_mm_cumsum
<dbl> <dbl>
1 39.1 39.1
2 39.5 78.6
3 40.3 119.
4 36.7 156.
5 39.3 195.
6 38.9 234.
7 39.2 273
8 34.1 307.
9 42 349.
10 37.8 387.
# … with 332 more rows
: reduces multiple values down to a single summarypenguins |>
group_by(species, sex) |>
filter(!is.na(sex)) |>
summarise(mean = mean(body_mass_g, na.rm = TRUE)) # give column a name
# A tibble: 6 × 3
# Groups: species [3]
species sex mean
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
Explore the wheels
Be ready to share some of your code in the chat
Tidy Data Tutor lets you write R and Tidyverse code in your browser and see how your data frame changes at each step of a data analysis pipeline.
Material inspired by and remixed from:
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).