Data Types, Structures and Wrangling

R Foundations 2024

Ella Kaye, Department of Statistics

October 24, 2024

Overview

  • Data types

  • Data structures

  • Data import and wrangling

Assigning in R

The assignment operator in R is <-

We can create objects in R and assign them names:

x <- 1 + 3

Then we can inspect the objects we have created:

x
[1] 4

And use them further:

x + 5
[1] 9

Naming objects

Object names cannot:

  • start with a number

  • contain certain characters like , - ?

  • contain a space (unless in ``, but that is not best practice)

  • meaningful yet concise
day_one
day_1
  • consistent
i_use_snake_case
other.people.use.periods
evenOthersUseCamelCase

The tidyverse has popularised the use of snake_case. Camel case is a better option for screen readers. The use of periods is discouraged because periods have other uses in R.

  • not meaningful
foo
bar
  • unnecessarily long or difficult to read
first_day_of_month
dayone
  • inconsistent

RStudio demo

Assigning and environment pane.

There is an RStudio shortcut for <- which also puts spaces around it:

Alt/ + -

Data types and structures

Basic data types in R

  • character: "a", "hello, world!"

  • double: 3, 3.14, pi

  • integer: 3L (the L tells R to store this as an interger)

  • logical: TRUE and FALSE

  • complex: 3+2i. N.B. need to write 1i for \(\sqrt(-1)\).

  • raw: holds raw bytes (rarely used)

N.B. double and integer types are both numeric

Special values

  • NA: The value NA is given to any data which R knows to be missing. It is not a character string, i.e. it is different to "NA"

  • Inf: Positive infinity, e.g. the result of dividing a non-zero number by zero

  • NaN: Not a number, e.g. attempting to find the logarithm of a negative number

  • NULL: The null object. Often returned by expressions and functions whose value is undefined

Data structures

Data structures are the building blocks of R code.

In R, the main types of structures are

  • vectors

  • factors

  • matrices and arrays

  • lists

  • data frames

Focus today on vectors, factors and data frames.

Vectors

A single number is a special case of a numeric vector. Vectors of length greater than one can be created using the concatenate function, c.

x <- c(1, 3, 6)
fruits <- c("apple", "pear")

The elements of the vector must be of the same type: common types are numeric, character and logical.

There are built-in functions for getting information about vectors, e.g.

length(fruits)
[1] 2

Creating vectors

There are some useful shortcuts for certain types of vector:

1:5
[1] 1 2 3 4 5
seq(from = 3, to = 5, by = 0.5)
[1] 3.0 3.5 4.0 4.5 5.0
LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"

What do you think letters returns?

Subsetting vectors

We subset vectors using []:

  • By position, starting at 1
letters[c(1, 5, 9, 15, 21)]
[1] "a" "e" "i" "o" "u"
  • By logical vector
x <- c(5, 3, 6, 1)
x[c(TRUE, FALSE, TRUE, FALSE)]
[1] 5 6
x[x > 4]
[1] 5 6

Your turn!

  • In your R-Foundations project from last week, create and save a new script called data-types.R

  • Look at the help page for the rep() function: ?rep

  • Starting with the vector x <- c(1,3,6), can you make the following patterns:

    • 1, 3, 6, 1, 3, 6
    • 1, 1, 3, 3, 6, 6
    • 1, 1, 3, 3, 6, 6, 1, 1, 3, 3, 6, 6
  • What does rep(x, 2, 2) give? Is it what you expected? Can you explain the output?

Factors

Factors are used to represent categorical data. They can be ordered or unordered.

Factors are stored as integers, and have labels associated with these unique integers. While factors usually look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Factors

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

fruits <- factor(c("apple", "apple", "pear"))
fruits
[1] apple apple pear 
Levels: apple pear
str(fruits)
 Factor w/ 2 levels "apple","pear": 1 1 2

The forcats package from the tidyverse has many functions for dealing with factors.

Data frames

Data sets are stored in R as data frames

These are structured as a list of objects, typically vectors, of the same length.

library(tidyverse)
library(palmerpenguins)
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

tibbles

From the tibble page:

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

Creating data frames

`my numbers` <- 1:9
fruits <- rep(c("apple", "pear", "orange"), 3)
data.frame(`my numbers`, 
           fruits)
  my.numbers fruits
1          1  apple
2          2   pear
3          3 orange
4          4  apple
5          5   pear
6          6 orange
7          7  apple
8          8   pear
9          9 orange
tibble(`my numbers`, 
       fruits)
# A tibble: 9 × 2
  `my numbers` fruits
         <int> <chr> 
1            1 apple 
2            2 pear  
3            3 orange
4            4 apple 
5            5 pear  
6            6 orange
7            7 apple 
8            8 pear  
9            9 orange

Spot the differences!

Import dataset (button)

Using the Import Dataset dialog in RStudio

we can import files stored locally or online in the following formats:

  • .txt/.csv via read_delim/read_csv from readr.
  • .xlsx via read_excel from readxl.
  • .sav/.por , .sas7bdat and .dta via read_spss, read_sas and read_stata respectively from haven.

Most of these functions also allow files to be compressed, e.g. as .zip.

File names and paths and project structure

It’s REALLY important to have good file names and paths, and a good project structure.

I leave you in the extremely capable hand of Danielle Navarro to take you thoroughly through best practices:

https://djnavarro.net/slides-project-structure/#1

I also HIGHLY recommend you check out the here package, which enables easy file referencing in project-oriented workflows.

Import data (code)

The rio package provides a common interface to the functions used by Import Dataset as well as many others.

The data format is automatically recognised from the file extension. To read the data in as a tibble, we use the setclass argument.

library(rio)
penguins_lter <- import("data/penguins_lter.csv")
penguins_lter_tbl <- import("data/penguins_lter.csv", setclass = "tibble")

See ?rio for the underlying functions used for each format and the corresponding optional arguments, e.g. the skip argument to read_excel to skip a certain number of rows.

Import data demo

Show both button (in Environment and from the file itself) and code

Data wrangling

dplyr

The dplyr package (part of the tidyverse) provides the following key functions to operate on data frames:

  • filter()
  • arrange()
  • select()
  • mutate()
  • summarise()

They all take a data frame as their first argument. The subsequent arguments describe what to do with the data frame. The result is a new data frame.

Load packages

library(dplyr)
library(palmerpenguins)

filter(): pick rows based on values of observations.

filter(penguins, 
       species == "Gentoo", 
       bill_length_mm > 48 & bill_depth_mm > 15, 
       !is.na(sex))
# A tibble: 39 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           50            16.3               230        5700
 2 Gentoo  Biscoe           50            15.2               218        5700
 3 Gentoo  Biscoe           49            16.1               216        5550
 4 Gentoo  Biscoe           49.3          15.7               217        5850
 5 Gentoo  Biscoe           49.2          15.2               221        6300
 6 Gentoo  Biscoe           48.7          15.1               222        5350
 7 Gentoo  Biscoe           50            15.3               220        5550
 8 Gentoo  Biscoe           59.6          17                 230        6050
 9 Gentoo  Biscoe           48.4          16.3               220        5400
10 Gentoo  Biscoe           48.7          15.7               208        5350
# ℹ 29 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • variable names are unquoted

  • building blocks of conditions:

Building block R code
Binary comparisons ><==<=>=!=
Logical operators or |, and &, not !
Value matching e.g. x %in% 6:9
Missing indicator e.g. is.na(x)

select(): select variables (columns) in a dataset

select(penguins, bill_length_mm, bill_depth_mm)
# A tibble: 344 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           39.1          18.7
 2           39.5          17.4
 3           40.3          18  
 4           NA            NA  
 5           36.7          19.3
 6           39.3          20.6
 7           38.9          17.8
 8           39.2          19.6
 9           34.1          18.1
10           42            20.2
# ℹ 334 more rows
select(penguins, bill_length_mm:body_mass_g)
# A tibble: 344 × 4
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
            <dbl>         <dbl>             <int>       <int>
 1           39.1          18.7               181        3750
 2           39.5          17.4               186        3800
 3           40.3          18                 195        3250
 4           NA            NA                  NA          NA
 5           36.7          19.3               193        3450
 6           39.3          20.6               190        3650
 7           38.9          17.8               181        3625
 8           39.2          19.6               195        4675
 9           34.1          18.1               193        3475
10           42            20.2               190        4250
# ℹ 334 more rows
select(penguins, starts_with("bill"))
# A tibble: 344 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           39.1          18.7
 2           39.5          17.4
 3           40.3          18  
 4           NA            NA  
 5           36.7          19.3
 6           39.3          20.6
 7           38.9          17.8
 8           39.2          19.6
 9           34.1          18.1
10           42            20.2
# ℹ 334 more rows
select(penguins, where(is.numeric))
# A tibble: 344 × 5
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
            <dbl>         <dbl>             <int>       <int> <int>
 1           39.1          18.7               181        3750  2007
 2           39.5          17.4               186        3800  2007
 3           40.3          18                 195        3250  2007
 4           NA            NA                  NA          NA  2007
 5           36.7          19.3               193        3450  2007
 6           39.3          20.6               190        3650  2007
 7           38.9          17.8               181        3625  2007
 8           39.2          19.6               195        4675  2007
 9           34.1          18.1               193        3475  2007
10           42            20.2               190        4250  2007
# ℹ 334 more rows

There are several other selectors. See ?dplyr::select or online for further details.

A note about pipes: |> vs %>%

Pipes pass what comes before into an argument (by default the first) of what comes after.

Pipes are a powerful tool for clearly expressing a sequence of multiple operations.

|>

  • The ‘native’ pipe, built into base R since v4.1 (May 2021)
  • Improved in v4.2 (April 2022)

%>%

  • Has been around in the magrittr package since 2014
  • Used throughout the tidyverse (though that is changing)

Pipes: similarity

By default, a pipe takes what comes before and pass it to first argument of what comes after.

log(2, base = 10)
[1] 0.30103
2 %>% log(base = 10)
[1] 0.30103
2 |> log(base = 10)
[1] 0.30103
paste("a", "b", "c")
[1] "a b c"
"a" %>% paste("b", "c")
[1] "a b c"
"a" |> paste("b", "c")
[1] "a b c"

Pipes: key difference

So far, so good, but what if we want to pipe into a subsequent argument?

log(2, 10) ## don't need to name the argument
[1] 0.30103
10 %>% log(2, .)
[1] 0.30103
10 |> log(2, base = _)
[1] 0.30103
paste("a", "b", "c")
[1] "a b c"
"b" %>% paste("a", ., "c")
[1] "a b c"
"b" |> paste("a", ..2 = _, "c")
[1] "a b c"

Different placeholder (. vs _) and with native pipe need a named argument

Pipe keyboard shortcut

There is an RStudio shortcut for the pipe which also puts spaces around it:

Ctrl/ + + M.

This can be set to either %>% or |> in the RStudio preferences.

Go to Tools -> Global Options -> Code and check/uncheck box for “Use native pipe operator”.

arrange(): change the ordering of rows

penguins |>
  select(species, sex, flipper_length_mm) |>
  arrange(flipper_length_mm)
# A tibble: 344 × 3
   species   sex    flipper_length_mm
   <fct>     <fct>              <int>
 1 Adelie    female               172
 2 Adelie    female               174
 3 Adelie    female               176
 4 Adelie    female               178
 5 Adelie    male                 178
 6 Adelie    female               178
 7 Chinstrap female               178
 8 Adelie    <NA>                 179
 9 Adelie    <NA>                 180
10 Adelie    male                 180
# ℹ 334 more rows
penguins |>
  select(species, sex, flipper_length_mm) |>
  arrange(species, flipper_length_mm)
# A tibble: 344 × 3
   species sex    flipper_length_mm
   <fct>   <fct>              <int>
 1 Adelie  female               172
 2 Adelie  female               174
 3 Adelie  female               176
 4 Adelie  female               178
 5 Adelie  male                 178
 6 Adelie  female               178
 7 Adelie  <NA>                 179
 8 Adelie  <NA>                 180
 9 Adelie  male                 180
10 Adelie  male                 180
# ℹ 334 more rows
penguins |>
  select(species, sex, flipper_length_mm) |>
  arrange(desc(flipper_length_mm))
# A tibble: 344 × 3
   species sex   flipper_length_mm
   <fct>   <fct>             <int>
 1 Gentoo  male                231
 2 Gentoo  male                230
 3 Gentoo  male                230
 4 Gentoo  male                230
 5 Gentoo  male                230
 6 Gentoo  male                230
 7 Gentoo  male                230
 8 Gentoo  male                230
 9 Gentoo  male                229
10 Gentoo  male                229
# ℹ 334 more rows

mutate(): create and modify columns

penguins |>
  select(bill_length_mm) |>
  mutate(bill_length_mm_sq = bill_length_mm^2)
# A tibble: 344 × 2
   bill_length_mm bill_length_mm_sq
            <dbl>             <dbl>
 1           39.1             1529.
 2           39.5             1560.
 3           40.3             1624.
 4           NA                 NA 
 5           36.7             1347.
 6           39.3             1544.
 7           38.9             1513.
 8           39.2             1537.
 9           34.1             1163.
10           42               1764 
# ℹ 334 more rows
penguins |>
  filter(species == "Gentoo") |>
  select(sex, flipper_length_mm) |>
  mutate(size = if_else(flipper_length_mm > 217, "big", "small"))
# A tibble: 124 × 3
   sex    flipper_length_mm size 
   <fct>              <int> <chr>
 1 female               211 small
 2 male                 230 big  
 3 female               210 small
 4 male                 218 big  
 5 male                 215 small
 6 female               210 small
 7 female               211 small
 8 male                 219 big  
 9 female               209 small
10 male                 215 small
# ℹ 114 more rows
penguins |>
  select(bill_length_mm) |>
  filter(!is.na(bill_length_mm)) |>
  mutate(bill_length_mm_cumsum = cumsum(bill_length_mm))
# A tibble: 342 × 2
   bill_length_mm bill_length_mm_cumsum
            <dbl>                 <dbl>
 1           39.1                  39.1
 2           39.5                  78.6
 3           40.3                 119. 
 4           36.7                 156. 
 5           39.3                 195. 
 6           38.9                 234. 
 7           39.2                 273  
 8           34.1                 307. 
 9           42                   349. 
10           37.8                 387. 
# ℹ 332 more rows

summarise(): reduces multiple values down to a single summary

penguins |>
  summarise(mean(body_mass_g, na.rm = TRUE))  
# A tibble: 1 × 1
  `mean(body_mass_g, na.rm = TRUE)`
                              <dbl>
1                             4202.
penguins |>
  group_by(species, sex) |>
  filter(!is.na(sex)) |>
  summarise(mean = mean(body_mass_g, na.rm = TRUE)) |>  # give column a name
  ungroup() # best practice after group_by()
# A tibble: 6 × 3
  species   sex     mean
  <fct>     <fct>  <dbl>
1 Adelie    female 3369.
2 Adelie    male   4043.
3 Chinstrap female 3527.
4 Chinstrap male   3939.
5 Gentoo    female 4680.
6 Gentoo    male   5485.
penguins |>
  filter(!is.na(sex)) |>
  summarise(mean = mean(body_mass_g, na.rm = TRUE),
            .by = c(species, sex)) # new in dplyr 1.1.0, note the `.`
# A tibble: 6 × 3
  species   sex     mean
  <fct>     <fct>  <dbl>
1 Adelie    male   4043.
2 Adelie    female 3369.
3 Gentoo    female 4680.
4 Gentoo    male   5485.
5 Chinstrap female 3527.
6 Chinstrap male   3939.
penguins |>
  count(species, island)
# A tibble: 5 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Shortcut for

penguins |>
  summarise(n = n(), .by = c(species, island))

Your turn!

tidydatatutor.com

Tidy Data Tutor lets you write R and Tidyverse code in your browser and see how your data frame changes at each step of a data analysis pipeline.

DEMO

End matter

Additional resources

Sources

Material inspired by and remixed from:

License

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).