R Foundations Course
October 10, 2022
Why use R?
Use RStudio to write and run R programmes
Create and start an R project
Use install.packages()
to install packages
How to get help in R
See examples of data wrangling and visualisation
Data import
Data management and wrangling
Exploratory data analysis
Statistical modelling
Advanced statistics
Machine learning
Data visualisation
Reports, articles, dashboards, presentations, websites
Integrates well with other languages
Packages: share your code and use others
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar and data structures.
palmerpenguins is a dataset package, designed to be a great example for data exploration and visualisation.
It contains measurement data for 344 penguins, from three different species, collected from three islands in the Palmer Archipelago, Antarctica.
Another reason to love R is the community around it.
It prides itself on being friendly, diverse, helpful, and supportive.
R user groups, especially Warwick RUG
RLadies, especially RLadies Coventry
#RStats and #TidyTuesday on twitter
RainbowR (LGBTQ+)
MiR (minorities in R)
An RStudio project is a contect for working on a specific project
Keeps files well-organised
Automatically sets working directory to project root
Has separate workspace and command history
Works well with version control via git or svn
Create a project from a new or existing directory via the file menu or new project button
Switch project, or open a different project in a new RStudio instance via the project menu
Create R-Foundations project
Create first script
For things that only need doing once, e.g. installing packages
For doing things you don’t need to track, e.g. requesting help files
To quickly explore new ideas before adding them to a script
RStudio provides a few shortcuts to help write code in the R console
Code completion (using Tab) is also provided in the source editor
Text files saved with an .R suffix are recognised as R code
Code can be sent directly from a script to the console as follows:
Writing an R script for an analysis has several advantages over a graphical user interface (GUI)
Organising your R script well will help you and others understand and use it.
###
or #---
to separate sections (in RStudio Code > Insert Section)In console
Install a package with install.packages("package_name")
Install multiple packages with install.packages(c("package1", "package2"))
c()
function creates a vectorOr use install button in packages pane:
In script
library(package_name)
help.search
# when ?function_name fails (e.g. don't have library loaded)
help.search("function_name") # note quotation marks
??function_name
# for when you can't quite remember the function name
??something_like_function_name
# R help start page
help.start()
## note that using the RStudio Help home button gives even more resources
# `help`
help(package = "package_name")
# Help panel in RStudio will give links to all documentation
# and help pages for that package
# find/browse vignettes for installed (or specific) packages
browseVignettes()
browseVignettes("package_name")
# use auto-completion in RStudio to see what functions
# are in a package
?package::
dplyr
demo
RStudio Help home demo
Create an R project called “R-foundations”
Install the packages palmerpenguins and tidyverse
Find the help page for the penguins
dataset
Find the help page for the filter
function in the dplyr
package
Experiment typing commands into the console or in an R script.
link to slides
What can we learn about penguins?
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
penguins |>
filter(sex == "female",
species == "Gentoo") |>
slice_max(body_mass_g, n = 5) |>
select(contains("bill"))
# A tibble: 5 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 46.5 14.8
2 45.2 14.8
3 49.1 14.8
4 44.9 13.3
5 45.1 14.5
|>
is a pipe. It passes what comes before into the first argument of what comes after.
Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
ggplot(data = penguins,
aes(x = bill_length_mm,
y = bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "gray50") +
labs(title = "Penguin bill dimensions",
subtitle = "Bill length and depth for Penguins at Palmer Station LTER",
x = "Bill length (mm)",
y = "Bill depth (mm)") +
theme_minimal() +
theme(plot.title.position = "plot",
text = element_text(size = 20))
ggplot(data = penguins,
aes(x = bill_length_mm,
y = bill_depth_mm,
group = species)) +
geom_point(aes(color = species,
shape = species),
size = 3,
alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, aes(color = species)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(title = "Penguin bill dimensions",
subtitle = "Bill length and depth for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
x = "Bill length (mm)",
y = "Bill depth (mm)",
color = "Penguin species",
shape = "Penguin species") +
theme_minimal() +
theme(legend.position = c(0.85, 0.15),
plot.title.position = "plot",
text = element_text(size = 20))
This is an illustration of Simpson’s Paradox.
Simpson’s paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
Example adapted from https://allisonhorst.github.io/palmerpenguins/articles/examples.html
Material inspired by and remixed from:
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).