- WEEK 1: Fundamentals
- Introduction to RStudio
- Introduction to R
- Project management with RStudio
- Seeking help
- Data structures
- Exploring data frames
- Subsetting data
- WEEK 2: Building Programs in R
- Control flow
- Vectorization
- Higher-order functions
apply()
: Apply a function over the margins of an arraylapply()
: Apply a function over a list, returning a listsapply()
: Apply a function polymorphically over list, returning vector, matrix, or array as appropriate- Use
apply
and friends to extract nested data from a list - (Optional) Convert nested list into data frame
- Functions explained
- Reading and writing data
- WEEK 3: Tidyverse
- Data frame manipulation with dplyr
- Data frame manipulation with tidyr
- Additional tidyverse libraries
- (Optional) Database interfaces
- Endnotes
- Credits
- References
- Data Sources
WEEK 1: Fundamentals
Introduction to RStudio
Orientation
- R was created by statisticians for statisticians (and other researchers)
- R contains multitudes; this can be good and bad
RStudio configuration
Configuration menu
- PC/Linux: Tools > Global Options
- MacOS: RStudio > Preferences or Tools > Global Options
Helpful configuration settings
- General > Basic
- Don't save or restore .RData
- Code > Editing
- Use native pipe operator
- Ctrl+Enter executes single line (or Multi-line R statement)
- Code > Display
- Rainbow parentheses
- Appearance: Adjust font and syntax colors
- Pane Layout: Move IDE panes
(Optional) Workstation configuration
By default, your view of your file system will be opaque. We want to make it transparent (e.g. you may have a local Desktop and a cloud Desktop folder).
Mac OS Finder > Preferences
Your local Desktop folder is in your Home directory.
- General
- New finder window shows: /Users/<home>
- Sidebar
- Favorites: /Users/<home>
- iCloud: iCloud Drive
- Locations: <computer name>, Cloud Storage
- Advanced
- Show all filename extensions
- Keep folders on top (all)
Windows System > File Explorer
Your local Desktop folder is in your Home directory or Computer directory.
- File > Change folder and search options > View
- Files and Folders
- Show hidden files, folders, and drives
- Hide protected operating system files
- Uncheck Hide extensions for known file types
- Navigation Pane
- Show all folders
- Files and Folders
- View
- File name extensions
Workflow in RStudio
-
Set working directory
-
Test code snippets in the R console [REPL]
print("hello")
-
Create an .R script in the working directory
print("hello")
-
Run the script
- Keyboard shortcut
- Windows/Linux:
Control-Enter
- MacOS:
Command-Enter
- Windows/Linux:
- Run button
- Highlight and run lines
- Keyboard shortcut
-
Source the script to reduce console clutter and make contents available to other scripts
-
Insert assignment arrow
<-
- MacOS:
Option -
- Windows/Linux:
Alt -
- Good customization:
Control -
- MacOS:
-
Break execution if console hangs
- Windows:
ESC
- MacOS/Linux:
Control-c
- Windows:
-
Clear console
- RStudio:
C-l
- Emacs:
C-c M-o
/M-x comint-clear-buffer
- RStudio:
-
Comment/Uncomment code
- MacOS:
Command-/
- MacOS:
Introduction to R
A whirlwind tour of R fundamentals
Mathematical expressions
1 + 100
(3 + 5) * 2 # operator precedence
5 * (3 ^ 2) # powers
2/10000 # outputs 2e-04
2 * 10^(-4) # 2e-04 explicated
Built-in functions
-
Some functions need inputs ("arguments")
getwd() # no argument required sin(1) # requires arg log(1) # natural log
-
RStudio has auto-completion
log...
-
Use
help()
to find out more about a functionhelp(exp) exp(0.5) # e^(1/2)
Comparing things
-
Basic comparisons
1 == 1 1 != 2 1 < 2 1 <= 1
-
Use
all.equal()
for floating point numbersall.equal(3.0, 3.0) # TRUE all.equal(2.99, 3.0) # 7 places: Gives difference all.equal(2.99999999, 3.0) # 8 places: TRUE 2.99999999 == 3.0 # 8 places: FALSE
Variables and assignment
-
R uses the assignment arrow (
C-c C-=
in ESS)# Assign a value to the variable name x <- 0.025
-
You can inspect a variable's value in the Environment tab or by evaluating it in the console
# Evaluate the variable and echo its value to the console x
-
Variables can be re-used and re-assigned
log(x) x <- 100 x <- x + 1 y <- x * 2
-
Use a standard naming scheme for your variables
r.style.variable <- 10 python_style_variable <- 11 javaStyleVariable <- 12
Vectorization
Vectorize all the things! This makes idiomatic R very different from most programming languages, which use iteration ("for" loops) by default.
# Create a sequence 1 - 5
1:5
# Raise 2 to the Nth power for each element of the sequence
2^(1:5)
# Assign the resulting vector to a variable
v <- 1:5
2^v
Managing your environment
ls() # List the objects in the environment
ls # Echo the contents of ls(), i.e. the code
rm(x) # Remove the x object
rm(list = ls()) # Remove all objects in environment
Note that parameter passing (=
) is not the same as assignment (<-
)
in R!
Built-in data sets
data()
R Packages
"Package" and "library" are roughly interchangeable.
-
Install additional packages
install.packages("tidyverse") ## install.packages("rmarkdown")
-
Activate a package for use
library("tidyverse")
Project management with RStudio
General file management
See /scripts/curriculum.Rmd
project_name
├── project_name.Rproj
├── README.md
├── script_1.R
├── script_2.R
├── data
│ ├── processed
│ └── raw
├── results
└── temp
Create projects with Rstudio
- File > New Project
- Create in existing Folder
- If you close RStudio and double-click Rproj, RStudio will open to the project location and set the working directory.
Seeking help
Basic help syntax
help(write.csv)
?write.csv
Help file format
- Description
- Usage
- Arguments
- Details
- Examples (highlight and run with
C-Enter
)
Special operators
help("<-")
Library examples
vignette("dplyr")
What if you don't know where to start?
-
RStudio autocomplete
-
Fuzzy search
??set
-
Browse by topic: https://cran.r-project.org/web/views/
Data structures
R stores "atomic" data as vectors
There are no scalars in R; everything is a vector, even if it's a vector of length 1.
v <- 1:5
length(v)
length(3.14)
Every vector has a type
There are 5 basic (vector) data types: double, integer, complex, logical and character.
typeof(v)
typeof(3.14)
typeof(1L)
typeof(1+1i)
typeof(TRUE)
typeof("banana")
Vectors and type coercion
-
A vector must be all one type. If you mix types, R will perform type coercion. See coercion rules in scripts/curriculum.Rmd
c(2, 6, '3') c(0, TRUE)
-
You can change vector types
# Create a character vector chr_vector <- c('0', '2', '4') str(chr_vector) # Use it to create a numeric vector num_vector <- as.numeric(chr_vector) # Show the structure of the collection str(num_vector)
-
There are multiple ways to generate vectors
# Two options for generating sequences 1:10 seq(10) # The seq() function is more flexible series <- seq(1, 10, by=0.1) series
-
Get information about a collection
# Don't print everything to the screen length(series) head(series) tail(series, n=2)
# You can add informative labels to most things in R names(v) <- c("a", "b", "c", "d", "e") v str(v)
-
Get an item by its position or label
v[1] v["a"]
-
Set an item by its position or label
v[1] = 4 v
-
(Optional) New vectors are empty by default
# Vectors are logical by default vector1 <- vector(length = 3) vector1 # You can specify the type of an empty vector vector2 <- vector(mode="character", length = 3) vector2 str(vector2)
Challenge 1: Generate and label a vector
See /scripts/curriculum.Rmd
Matrices
-
A matrix is 2-dimensional vector
# Create a matrix of zeros mat1 <- matrix(0, ncol = 6, nrow = 3) # Inspect it class(mat1) typeof(mat1) str(mat1)
-
Some operations act as if the matrix is a 1-D wrapped vector
mat2 <- matrix(1:25, nrow = 5, byrow = TRUE) str(mat2) length(mat2)
(Optional) Factors
-
Factors represent unique levels (e.g., experimental conditions)
coats <- c("tabby", "tortoise", "tortoise", "black", "tabby") str(coats) # The reprentation has 3 levels, some of which have multiple instances categories <- factor(coats) str(categories)
-
R assumes that the first factor represents the baseline level, so you may need to change your factor ordering so that it makes sense for your variables
## "control" should be the baseline, regardless of trial order trials <- c("manipulation", "control", "control", "manipulation") trial_factors <- factor(trials, levels = c("control", "manipulation")) str(trial_factors)
Data Frames are central to working with tabular data
-
Create a data frame
coat = c("calico", "black", "tabby") weight = c(2.1, 5.0, 3.2) chases_bugs = c(1, 0, 1) cats <- data.frame(coat, weight, chases_bugs) cats # show contents of data frame str(cats) # inspect structure of data frame # Convert chases_bugs to logical vector cats$chases_bugs <- as.logical(cats$chases_bugs) str(cats)
-
Write the data frame to a CSV and re-import it. You can use
read.delim()
for tab-delimited files, orread.table()
for flexible, general-purpose input.write.csv(x = cats, file = "../data/feline_data.csv", row.names = FALSE) cats <- read.csv(file = "../data/feline_data.csv", stringsAsFactors = TRUE) str(cats) # the chr column is now a factor column
-
Access the column (vectors) of the data frame
cats$weight cats$coat
-
A vector can only hold one type. Therefore, in a data frame each data column (vector) has to be a single type.
typeof(cats$weight)
-
Use data frame vectors in operations
cats$weight + 2 paste("My cat is", cats$coat) # Operations have to be legal for the data type cats$coat + 2 # Operations are ephemeral unless their outputs are reassigned to the variable cats <- cats$weight + 1
-
Data frames have column names
names()
gets or sets a namenames(cats) names(cats)[2] <- "weight_kg" cats
Lists
-
Lists can contain anything
list1 <- list(1, "a", TRUE, 1+4i) # Inspect each element of the list list1[[1]] list1[[2]] list1[[3]] list1[[4]]
If you use a single bracket
[]
, you get back a shorter section of the list, which is also a list. Use double brackets[[]]
to drill down to the actual value. -
(Optional) This includes complex data structures
list2 <- list(title = "Numbers", numbers = 1:10, data = TRUE) # Single brackets retrieve a slice of the list, containing the name:value pair list2[2] # Double brackets retrieve the value, i.e. the contents of the list item list2[[2]]
-
Data frames are lists of vectors and factors
typeof(cats)
-
Some operations return lists, others return vectors (basically, are you getting the column with its label, or are you drilling down to the data?)
-
Get list slices
# List slices cats[1] # list slice by index cats["coat"] # list slice by name cats[1, ] # get data frame row by row number
-
Get list contents (in this case, vectors)
# List contents (in this case, vectors) cats[[1]] # content by index cats[["coat"]] # content by name cats$coat # content by name; shorthand for `cats[["coat"]]` cats[, 1] # content by index, across all rows cats[1, 1] # content by index, single row
-
You can inspect all of these with
typeof()
-
Note that you can address data frames by row and columns
-
(Optional) Challenge 2: Creating matrices
See /scripts/curriculum.Rmd
Exploring data frames
Adding columns
age <- c(2, 3, 5)
cbind(cats, age)
cats # cats is unchanged
cats <- cbind(cats, age) # overwrite old cats
# Data frames enforce consistency
age <- c(2, 5)
cats <- cbind(cats, age)
Appending rows (remember, rows are lists!)
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
# Legal values added, illegal values are NA
cats
# Update the factor set so that "tortoiseshell" is a legal value
levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
Removing missing data
cats
is now polluted with missing data
na.omit(cats)
cats
cats <- na.omit(cats)
Working with realistic data
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)
# Get an overview of the data frame
str(gapminder)
dim(gapminder)
# It's a list
length(gapminder)
colnames(gapminder)
# Look at the data
summary(gapminder$gdpPercap) # summary varies by data type
head(gapminder)
Challenge 3: New gapminder data frame
See /scripts/curriculum.Rmd
Subsetting data
Subset by index
v <- 1:5
-
Index selection
v[1] v[1:3] # index range v[c(1, 3)] # selected indices
-
(Optional) Index exclusion
v[-1] v[-c(1, 3)]
Subset by name
letters[1:5]
names(v) <- letters[1:5]
-
Character selection
v["a"] v[names(v) %in% c("a", "c")]
-
(Optional) Character exclusion
v[! names(v) %in% c("a", "c")]
Subsetting matrices
m <- matrix(1:28, nrow = 7, byrow = TRUE)
# Matrices are just 2D vectors
m[2:4, 1:3]
m[c(1, 3, 5), c(2, 4)]
(Optional) Extracting list elements
Single brackets get you subsets of the same type (list -> list
,
vector -> vector
, etc.). Double brackets extract the underlying vector
from a list or data frame.
# Create a new list and give it names
l <- replicate(5, sample(15), simplify = FALSE)
names(l) <- letters[1:5]
# You can extract one element
l[[1]]
l[["a"]]
# You can't extract multiple elements
l[[1:3]]
l[[names(l) %in% c("a", "c")]]
Subsetting by logical operations
-
Explicitly mask each item using TRUE or FALSE. This returns the reduced vector.
v[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
-
Evaluate the truth of each item, then produce the TRUE ones
# Use a criterion to generate a truth vector v > 4 # Filter the original vector by the criterion v[v > 4]
-
Combining logical operations
v[v < 3 | v > 4]
(Optional) Subset by factor
# First three items
gapminder$country[1:3]
# All items in factor set
north_america <- c("Canada", "Mexico", "United States")
gapminder$country[gapminder$country %in% north_america]
Subsetting Data Frames
Data frames have characteristics of both lists and matrices.
-
Get first three rows
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE) # Get first three rows gapminder[1:3,]
-
Rows and columns
gapminder[1:6, 1:3] gapminder[1:6, c("country", "pop")]
-
Data frames are lists, so one index gets you the columns
gapminder[1:3]
-
Filter by contents
gapminder[gapminder$country == "Mexico",] north_america <- c("Canada", "Mexico", "United States") gapminder[gapminder$country %in% north_america,] gapminder[gapminder$country %in% north_america & gapminder$year > 1999,] gapminder[gapminder$country %in% north_america & gapminder$year > 1999, c("country", "pop")]
Challenge 4: Extract data by region
See /scripts/curriculum.Rmd
WEEK 2: Building Programs in R
Control flow
Conditionals
-
Look at Conditional template in curriculum.Rmd
-
If
x <- 8 if (x >= 10) { print("x is greater than or equal to 10") }
-
Else
if (x >= 10) { print("x is greater than or equal to 10") } else { print("x is less than 10") }
-
Else If
if (x >= 10) { print("x is greater than or equal to 10") } else if (x > 5) { print("x is greater than 5, but less than 10") } else { print("x is less than 5") }
-
Vectorize your tests
x <- 1:4 if (any(x < 2)) { print("Some x less than 2") } if (all(x < 2)){ print("All x less than 2") }
Review Subsetting section
Subsetting is frequently an alternative to if-else statements in R
Iteration
-
Look at Iteration template in curriculum.Rmd
-
Basic For loop
for (i in 1:10) { print(i) }
-
Nested For loop
for (i in 1:5) { for (j in letters[1:4]) { print(paste(i,j)) } }
-
This is where we skip the example where we append things to the end of a data frame. For loops are slow, vectorize operations are fast (and idiomatic). Use for loops where they're the appropriate tool (e.g., loading files, cycling through whole data sets, etc). We will see more of this in the section on reading and writing data.
Vectorization
Vector operations are element-wise by default
x <- 1:4
y <- 6:9
x + y
log(x)
# A more realistic example
gapminder$pop_millions <- gapminder$pop / 1e6
head(gapminder)
Vectors of unequal length are recycled
z <- 1:2
x + z
Logical comparisons
-
Do the elements match a criterion?
x > 2 a <- (x > 2) # you can assign the output to a variable # Evaluate a boolean vector any(a) all(a)
-
Can you detect missing data?
nan_vec <- c(1, 3, NaN) ## Which elements are NaN? is.nan(nan_vec) ## Which elements are not NaN? !is.nan(nan_vec) ## Are any elements NaN? any(is.nan(nan_vec)) ## Are all elements NaN? all(is.nan(nan_vec))
Matrix operations are also element-wise by default
m <- matrix(1:12, nrow=3, ncol=4)
# Multiply each item by -1
m * -1
Linear algebra uses matrix multiplication
# Multiply two vectors
1:4 %*% 1:4
# Matrix-wise multiplication
m2 <- matrix(1, nrow = 4, ncol = 1)
m2
m %*% m2
# Most functions operate on the whole vector or matrix
mean(m)
sum(m)
Challenge 5: Sum of squares
See /scripts/curriculum.Rmd
Higher-order functions
apply()
lets you apply an arbitrary function over a collection. This
is an example of a higher-order function (map, apply, filter, reduce,
fold, etc.) that can (and should) replace loops for most purposes. They
are an intermediate case between vectorized operations (very fast) and
for loops (very slow). Use them when you need to build a new collection
and vectorized operations aren't available.
apply()
: Apply a function over the margins of an array
m <- matrix(1:28, nrow = 7, byrow = TRUE)
apply(m, 1, mean)
apply(m, 2, mean)
apply(m, 1, sum)
apply(m, 2, sum)
lapply()
: Apply a function over a list, returning a list
lst <- list(title = "Numbers", numbers = 1:10, data = TRUE)
## length() returns the length of the whole list
length(lst)
## Use lapply() to get the length of the individual elements
lapply(lst, length)
sapply()
: Apply a function polymorphically over list, returning vector, matrix, or array as appropriate
## Simplify and return a vector by default
sapply(lst, length)
## Optionally, eturn the original data type
sapply(lst, length, simplify = FALSE)
Use apply
and friends to extract nested data from a list
-
Read a file JSON into a nested list
## Read JSON file into nested list library("jsonlite") books <- fromJSON("../data/books.json") ## View list structure str(books)
-
Extract all of the authors with
lapply()
. This requires us to define an anonymous function.## Extract a single author books[["bk110"]]$author ## Use lapply to extract all the authors authors <- lapply(books, function(x) x$author) ## Returns list str(authors)
-
Extract all of the authors with
sapply()
authors <- sapply(books, function(x) x$author) # Returns vector str(authors)
(Optional) Convert nested list into data frame
-
Method 1: Create a list of data frames, then bind them together into a single data frame
## This approach omits the top-level book id df <- do.call(rbind, lapply(books, data.frame))
lapply()
applies a given function for each element in a list, so there will be several function calls.do.call()
applies a given function to the list as a whole, so there is only one function call.
-
Method 2: Use the
rbindlist()
function from data.table## This approach includes the top-level book id df <- data.table::rbindlist(books, idcol = TRUE)
Functions explained
Functions let you encapsulate and re-use chunks of code. This has several benefits:
- Eliminates repetition in your code. This saves labor, but more importantly it reduces errors, and makes it easier for you to find and correct errors.
- Allows you to write more generic (i.e. flexible) code.
- Reduces cognitive overhead.
Defining a function
-
Look at Function template in data/curriculum.Rmd
-
Define a simple function
# Convert Fahrenheit to Celcius f_to_celcius <- function(temp) { celcius <- (temp - 32) * (5/9) return(celcius) }
-
Call the function
f_to_celcius(32) boiling <- f_to_celcius(212)
Combining functions
Define a second function and call the first function within the second.
f_to_kelvin <- function(temp) {
celcius <- f_to_celcius(temp)
kelvin <- celcius + 273.15
return(kelvin)
}
f_to_kelvin(212)
Most functions work with collections
## Create a vector of temperatures
temps <- seq(from = 1, to = 101, by = 10)
# Vectorized calculation (fast)
f_to_kelvin(temps)
# Apply
sapply(temps, f_to_kelvin)
Defensive programming
-
Check whether input meets criteria before proceeding (this is `assert` in other languages).
f_to_celcius <- function(temp) { ## Check inputs stopifnot(is.numeric(temp), temp > -460) celcius <- (temp - 32) * (5/9) return(celcius) } f_to_celcius("a") f_to_celcius(-470)
-
Fail with a custom error if criterion not met
f_to_celcius <- function(temp) { if(!is.numeric(temp)) { stop("temp must be a numeric vector") } celcius <- (temp - 32) * (5/9) return(celcius) }
Working with rich data
## Prerequisites
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE)
north_america <- c("Canada", "Mexico", "United States")
-
Calculate the total GDP for each entry in the data set
gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = TRUE) gdp <- gapminder$pop * gapminder$gdpPercap
-
Write a function to perform a total GDP calculation on a filtered subset of your data.
calcGDP <- function(df, year=NULL, country=NULL) { if(!is.null(year)) { df <- df[df$year %in% year, ] } if (!is.null(country)) { df <- df[df$country %in% country,] } gdp <- df$pop * df$gdpPercap new_df <- cbind(df, gdp=gdp) return(new_df) }
-
Mutating
df
inside the function doesn't affect the globalgapminder
data frame (because of pass-by-value and scope).
Challenge 6: Testing and debugging your function
See data/curriculum.Rmd
Reading and writing data
Create sample data sets and write them to the `processed` directory
-
Preliminaries
if (!dir.exists("../processed")) { dir.create("../processed") } north_america <- c("Canada", "Mexico", "United States")
-
Version 1: Use
calcGDP
functionfor (year in unique(gapminder$year)) { df <- calcGDP(gapminder, year = year, country = north_america) ## Generate a file name. This will fail if "processed" doesn't exist fname <- paste("../processed/north_america_", as.character(year), ".csv", sep = "") ## Write the file write.csv(x = df, file = fname, row.names = FALSE) }
-
Version 2: Bypass
calcGDP
functionfor (year in unique(gapminder$year)) { df <- gapminder[gapminder$year == year, ] df <- df[df$country %in% north_america, ] fname <- paste("processed/north_america_", as.character(year), ".csv", sep="") write.csv(x = df, file = fname, row.names = FALSE) }
How to find files
## Get matching files from the `processed` subdirectory
dir(path = "../processed", pattern = "north_america_[1-9]*.csv")
Read files using a for loop
-
Read each file into a data frame and add it to a list
## Create an empty list df_list <- list() ## Get the locations of the matching files file_names <- dir(path = "../processed", pattern = "north_america_[1-9]*.csv") file_paths <- file.path("../processed", file_names) for (f in file_paths){ df_list[[f]] <- read.csv(f, stringsAsFactors = TRUE) }
-
Access the list items to view the individual data frames
length(df_list) names(df_list) lapply(df_list, length) df_list[["north_america_1952.csv"]]
Read files using apply
-
Instead of a for loop that handles each file individually, use a single vectorized function.
df_list <- lapply(file_paths, read.csv, stringsAsFactors = TRUE) ## The resulting list does not have names set by default names(df_list) ## You can still access by index position df_list[[2]]
-
Add names manually
names(df_list) <- file_names df_list$north_america_1952.csv
-
(Optional) Automatically set names for the output list This example sets each name to the complete path name (e.g.,
"../processed/north_america_1952.csv"
).df_list <- sapply(file_paths, read.csv, simplify = FALSE, USE.NAMES = TRUE)
Concatenate list of data frames into a single data frame
-
Method 1: Create a list of data frames, then bind them together into a single data frame
df <- do.call(rbind, df_list)
lapply()
applies a given function for each element in a list, so there will be several function calls.do.call()
applies a given function to the list as a whole, so there is only one function call.
-
(Optional) Method 2: Use the
rbindlist()
function from data.table. This can be faster for large data sets. It also give you the option of preserving the list names (in this case, the source file names) as a new column in the new data frame.df_list <- sapply(file.path("../processed", file_names), read.csv, simplify = FALSE, USE.NAMES = TRUE) df <- data.table::rbindlist(df_list, idcol = TRUE)
WEEK 3: Tidyverse
Data frame manipulation with dplyr
Orientation
library("dplyr")
- Explain Tidyverse briefly: https://www.tidyverse.org/packages/
- (Optional) Demo unix pipes with
history | grep
- Explain tibbles briefly
- dplyr allows you to treat data frames like relational database tables; i.e. as sets
Select data frame variables
-
select()
provides a mini-language for selecting data frame variablesdf <- select(gapminder, year, country, gdpPercap) str(df)
-
select()
understands negation (and many other intuitive operators)df2 <- select(gapminder, -continent) str(df2)
-
You can link multiple operations using pipes. This will be more intuitive once we see this combined with
filter()
df <- gapminder %>% select(year, country, gdpPercap) ## You can use the native pipe. This has a few limitations: ## df <- gapminder |> select(year, country, gdpPercap)
Filter data frames by content
-
Filter by continent
df_europe <- gapminder %>% filter(continent == "Europe") %>% select(year, country, gdpPercap) str(df_europe)
-
Filter by continent and year
europe_2007 <- gapminder %>% filter(continent == "Europe", year == 2007) %>% select(country, lifeExp) str(europe_2007)
(Optional) Challenge 7: Filter
See data/curriculum.Rmd
Group rows
-
Group data by a data frame variable
grouped_df <- gapminder %>% group_by(continent) ## This produces a tibble str(grouped_df)
-
The grouped data frame contains metadata (i.e. bookkeeping) that tracks the group membership of each row. You can inspect this metadata:
grouped_df %>% tally () grouped_df %>% group_keys () grouped_df %>% group_vars () ## These produce a lot of output: grouped_df %>% group_indices () grouped_df %>% group_rows ()
- More information about grouped data frames: https://dplyr.tidyverse.org/articles/grouping.html
Summarize grouped data
-
Calculate mean gdp per capita by continent
grouped_df %>% summarise(mean_gdpPercap = mean(gdpPercap))
-
(Optional) Using pipes allows you to do ad hoc reporting with creating intermediate variables
gapminder %>% group_by(continent) %>% summarise(mean_gdpPercap = mean(gdpPercap))
-
Group data by multiple variables
df <- gapminder %>% group_by(continent, year) %>% summarise(mean_gdpPercap = mean(gdpPercap))
-
Create multiple data summaries
df <- gapminder %>% group_by(continent, year) %>% summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap), mean_pop = mean(pop), sd_pop = sd(pop))
Use group counts
-
count()
lets you get an ad hoc count of any variablegapminder %>% filter(year == 2002) %>% count(continent, sort = TRUE)
-
n()
gives the number of observations in a group## Get the standard error of life expectancy by continent gapminder %>% group_by(continent) %>% summarise(se_le = sd(lifeExp)/sqrt(n()))
Mutate the data to create new variables
Mutate creates a new variable within your pipeline
## Total GDP and population by continent and year
df <- gapminder %>%
mutate(gdp_billion = gdpPercap * pop / 10^9) %>%
group_by(continent, year) %>%
summarise(mean_gdp = mean(gdp_billion),
sd_gdp = sd(gdp_billion),
mean_pop = mean(pop),
sd_pop = sd(pop))
Add conditional filtering to a pipeline with ifelse
-
Perform previous calculation, but only in cases in which the life expectancy is over 25
df <- gapminder %>% mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% group_by(continent, year) %>% summarise(mean_gdp = mean(gdp_billion), sd_gdp = sd(gdp_billion), mean_pop = mean(pop), sd_pop = sd(pop))
-
(Optional) Predict future GDP per capita for countries with higher life expectancies
df <- gapminder %>% mutate(gdp_expected = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>% group_by(continent, year) %>% summarize(mean_gdpPercap = mean(gdpPercap), mean_gdpPercap_expected = mean(gdp_expected))
Challenge 8: Life expectancy in random countries
gapminder %>%
filter(year == 2002) %>%
group_by(continent) %>%
sample_n(2) %>%
summarize(mean_lifeExp = mean(lifeExp), country = country) %>%
arrange(desc(mean_lifeExp))
Data frame manipulation with tidyr
- Long format: All rows are unique observations (ideally)
- each column is a variable
- each row is an observation
- Wide format: Rows contain multiple observations
- Repeated measures
- Multiple variables
Gapminder data
library("tidyr")
library("dplyr")
str(gapminder)
- 3 ID variables: continent, country, year
- 3 Observation variables: pop, lifeExp, gdpPercap
Wide to long with pivot_longer()
-
Load wide gapminder data
gap_wide <- read.csv("../data/gapminder_wide.csv", stringsAsFactors = FALSE) str(gap_wide)
-
Group comparable columns into a single variable. Here we group all of the "pop" columns, all of the "lifeExp" columns, and all of the "gdpPercap" columns.
gap_long <- gap_wide %>% pivot_longer( cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')), names_to = "obstype_year", values_to = "obs_values" ) str(gap_long) head(gap_long, n=20)
- Original column headers become keys
- Original column values become values
- This pushes all values into a single column, which is unintuitive. We will generate the intermediate format later.
-
(Optional) Same pivot operation as (2), specifying the columns to be omitted rather than included.
gap_long <- gap_wide %>% pivot_longer( cols = c(-continent, -country), names_to = "obstype_year", values_to = "obs_values" ) str(gap_long)
-
Split compound variables into individual variables
gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_") gap_long$year <- as.integer(gap_long$year)
Long to intermediate with pivot_wider()
-
Recreate the original gapminder data frame (as a tibble)
## Read in the original data without factors for comparison purposes gapminder <- read.csv("../data/gapminder_data.csv", stringsAsFactors = FALSE) gap_normal <- gap_long %>% pivot_wider(names_from = obstype, values_from = obs_values) str(gap_normal) str(gapminder)
-
Rearrange the column order of
gap_normal
so that it matchesgapminder
gap_normal <- gap_normal[, names(gapminder)]
-
Check whether the data frames are equivalent (they aren't yet)
all.equal(gap_normal, gapminder) head(gap_normal) head(gapminder)
-
Change the sort order of
gap_normal
so that it matchesgap_normal <- gap_normal %>% arrange(country, year) all.equal(gap_normal, gapminder)
Long to wide with pivot_wider()
-
Create variable labels for wide columns. In this case, the new variables are all combinations of metric (pop, lifeExp, or gdpPercap) and year. Effectively we are squishing many columns together.
help(unite) df_temp <- gap_long %>% ## unite(ID_var, continent, country, sep = "_") %>% unite(var_names, obs_type, year, sep = "_") str(df_temp) head(df_temp, n=20)
-
Pivot to wide format, distributing data into columns for each unique label
gap_wide_new <- gap_long %>% ## unite(ID_var, continent, country, sep = "_") %>% unite(var_names, obs_type, year, sep = "_") %>% pivot_wider(names_from = var_names, values_from = obs_values) str(gap_wide_new)
-
Sort columns alphabetically by variable name, then check for equality. You can move a single column to a different positions with
relocate()
gap_wide_new <- gap_wide_new[,order(colnames(gap_wide_new))] all.equal(gap_wide, gap_wide_new)
Additional tidyverse libraries
Reading data with readr
Fast, user-friendly file imports.
String processing with stringr
Real string processing for R.
Functional programming with purrr
Functional programming for the Tidyverse. The map
family of functions
replaces the apply
family for most use cases. Map functions are
strongly typed. For example, you can use purrr:::map_chr()
to extract
nested data from a list:
## View the relevant map function
library("purrr")
library("jsonlite")
help(map_chr)
books <- fromJSON("books.json")
## Returns vector
authors <- map_chr(books, ~.x$author)
- The
~
operation in Purrr creates an anonymous function that applies to all the elements in the.x
collection.- Best overview in
as_mapper()
documentation: https://purrr.tidyverse.org/reference/as_mapper.html - https://stackoverflow.com/a/53160041
- https://stackoverflow.com/a/62488532
- https://stackoverflow.com/a/44834671
- Best overview in
- Additional references
(Optional) Database interfaces
Data frame joins with dplyr
- https://jozef.io/r006-merge/#alternatives-to-base-r
- https://dplyr.tidyverse.org/reference/mutate-joins.html
Access databases using dplyr
Endnotes
Credits
- R for Reproducible Scientific Analysis: https://swcarpentry.github.io/r-novice-gapminder/
- Andrea Sánchez-Tapia's workshop: https://github.com/AndreaSanchezTapia/UCMerced_R
- Instructor notes for "R for Reproducible Scientific Analysis": https://swcarpentry.github.io/r-novice-gapminder/guide/
References
-
R Project documentation: https://cran.r-project.org/manuals.html
-
CRAN task views: https://cran.r-project.org/web/views/
-
R Cookbook: http://www.cookbook-r.com
-
RStudio cheat sheets: https://www.rstudio.com/resources/cheatsheets/
-
Matrix algebra operations in R: https://www.statmethods.net/advstats/matrix.html
-
RStudio keyboard shortcuts: https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
-
RStudio shortcuts and tips: https://appsilon.com/rstudio-shortcuts-and-tips/
-
Why
typeof()
andclass()
give different outputs: https://stackoverflow.com/a/8857411 -
How to get function code from the different object systems: https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function
-
Various approaches to contrast coding: https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
If you tell R that a factor is ordered, it defaults to Orthogonal polynomial contrasts. This means that it assumes you want it to check for linear, cubic, and quadratic trends. If you tell R that a factor is NOT ordered, it defaults to treatment contrasts: it compares all levels to a reference level. This probably doesn't make sense for lots of psych data. So if I say income is ordered, it calculates linear, quadratic etc. trends for income, which is not only not what I want, but is inappropriate unless your groups are evenly spaced. Treatment means it calculates whether each level is significantly different from a reference level (i.e. the highest income group).
So if you want first-year stats output in a design with more than 2 levels in the factor, put this at the top of the R code:
options(contrasts = c("contr.sum","contr.poly"))
contr.sum
is R for deviation contrasts, which you may recall as contrasts like -1, 0, 1.
Data Sources
- Gapminder data:
- JSON derived from Microsoft sample XML file: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)