+ - 0:00:00
Notes for current slide
Notes for next slide

Intermediate R

UW Tacoma

Charles Lanfear

May 4, 2019
Updated: May 4, 2019

1 / 63

Overview

  1. Subsetting Data

  2. Creating Variables

  3. Summarizing Data

  4. Tidying Data

  5. Joining Data

  6. Resources for Further Learning

2 / 63

Setup

3 / 63

Packages Used Today

This workshop focuses on using packages from the tidyverse.

4 / 63

Packages Used Today

This workshop focuses on using packages from the tidyverse.

The tidyverse is a collection of R packages which share a design philosophy, syntax, and data structures.

4 / 63

Packages Used Today

This workshop focuses on using packages from the tidyverse.

The tidyverse is a collection of R packages which share a design philosophy, syntax, and data structures.

The tidyverse includes the most used packages in the R world: dplyr and ggplot2

4 / 63

Packages Used Today

This workshop focuses on using packages from the tidyverse.

The tidyverse is a collection of R packages which share a design philosophy, syntax, and data structures.

The tidyverse includes the most used packages in the R world: dplyr and ggplot2

You can install the entire tidyverse with the following:

install.packages("tidyverse")
4 / 63

Packages Used Today

This workshop focuses on using packages from the tidyverse.

The tidyverse is a collection of R packages which share a design philosophy, syntax, and data structures.

The tidyverse includes the most used packages in the R world: dplyr and ggplot2

You can install the entire tidyverse with the following:

install.packages("tidyverse")

We will also use the gapminder and nycflights13 datasets:

install.packages("gapminder")
install.packages("nycflights13")
4 / 63

Subsetting Data with dplyr

5 / 63

But First, Pipes: %>%

dplyr uses the magrittr forward pipe operator, usually called simply a pipe. We write pipes like %>% (Ctrl+Shift+M or ⌘ +Shift+M).

6 / 63

But First, Pipes: %>%

dplyr uses the magrittr forward pipe operator, usually called simply a pipe. We write pipes like %>% (Ctrl+Shift+M or ⌘ +Shift+M).

Pipes take the object on the left and apply the function on the right: x %>% f(y) = f(x, y). Read out loud: "and then..."

6 / 63

But First, Pipes: %>%

dplyr uses the magrittr forward pipe operator, usually called simply a pipe. We write pipes like %>% (Ctrl+Shift+M or ⌘ +Shift+M).

Pipes take the object on the left and apply the function on the right: x %>% f(y) = f(x, y). Read out loud: "and then..."

library(dplyr)
library(gapminder)
gapminder %>% filter(country == "Canada") %>% head(2)
## # A tibble: 2 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
6 / 63

But First, Pipes: %>%

dplyr uses the magrittr forward pipe operator, usually called simply a pipe. We write pipes like %>% (Ctrl+Shift+M or ⌘ +Shift+M).

Pipes take the object on the left and apply the function on the right: x %>% f(y) = f(x, y). Read out loud: "and then..."

library(dplyr)
library(gapminder)
gapminder %>% filter(country == "Canada") %>% head(2)
## # A tibble: 2 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.

Pipes save us typing, make code readable, and allow chaining like above, so we use them all the time when manipulating data frames.

6 / 63

Using Pipes

Pipes are clearest to read when you have each function on a separate line.

7 / 63

Using Pipes

Pipes are clearest to read when you have each function on a separate line.

take_this_data %>%
do_first_thing(with = this_value) %>%
do_next_thing(using = that_value) %>% ...
7 / 63

Using Pipes

Pipes are clearest to read when you have each function on a separate line.

take_this_data %>%
do_first_thing(with = this_value) %>%
do_next_thing(using = that_value) %>% ...

Stuff to the left of the pipe is passed to the first argument of the function on the right. Other arguments go on the right in the function.

7 / 63

Using Pipes

Pipes are clearest to read when you have each function on a separate line.

take_this_data %>%
do_first_thing(with = this_value) %>%
do_next_thing(using = that_value) %>% ...

Stuff to the left of the pipe is passed to the first argument of the function on the right. Other arguments go on the right in the function.

If you ever find yourself piping a function where data are not the first argument, use . in the data argument instead.

gapminder %>% lm(pop ~ year, data = .)
7 / 63

Pipe Assignment

When creating a new object from the output of piped functions, you place the assignment operator at the beginning.

lm_pop_year <- gapminder %>%
lm(pop ~ year, data = .)

No matter how long the chain of functions is, assignment is always done at the top.

8 / 63

filter Data Frames

I used filter() earlier. We subset rows of data using logical conditions with filter()!

gapminder %>% filter(country == "Oman") %>% head(8)
## # A tibble: 8 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Oman Asia 1952 37.6 507833 1828.
## 2 Oman Asia 1957 40.1 561977 2243.
## 3 Oman Asia 1962 43.2 628164 2925.
## 4 Oman Asia 1967 47.0 714775 4721.
## 5 Oman Asia 1972 52.1 829050 10618.
## 6 Oman Asia 1977 57.4 1004533 11848.
## 7 Oman Asia 1982 62.7 1301048 12955.
## 8 Oman Asia 1987 67.7 1593882 18115.

What is this doing?

9 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

  • !=: not equal to
10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values
10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
  • |: at least one condition needs to hold (OR)
10 / 63

Logical Operators

Logical operators test boolean conditions and return TRUE, FALSE, or NA.

filter() returns rows when the condition is TRUE.

We used == for testing "equals": country == "Oman".

There are many other logical operators:

  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
  • |: at least one condition needs to hold (OR)
  • !: inverts a logical condition (TRUE becomes FALSE, FALSE becomes TRUE)
10 / 63

Multiple Conditions Example

Let's say we want observations from Oman after 1980 and through 2000.

11 / 63

Multiple Conditions Example

Let's say we want observations from Oman after 1980 and through 2000.

gapminder %>%
filter(country == "Oman" &
year > 1980 &
year <= 2000 )
## # A tibble: 4 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Oman Asia 1982 62.7 1301048 12955.
## 2 Oman Asia 1987 67.7 1593882 18115.
## 3 Oman Asia 1992 71.2 1915208 18617.
## 4 Oman Asia 1997 72.5 2283635 19702.
11 / 63

%in% Operator

Common use case: Filter rows to things in some set.

We can use %in% like == but for matching any element in the vector on its right1.

former_yugoslavia <- c("Bosnia and Herzegovina", "Croatia",
"Macedonia", "Montenegro", "Serbia", "Slovenia")
yugoslavia <- gapminder %>% filter(country %in% former_yugoslavia)
tail(yugoslavia, 2)
## # A tibble: 2 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Slovenia Europe 2002 76.7 2011497 20660.
## 2 Slovenia Europe 2007 77.9 2009245 25768.

[1] The c() function is how we make vectors in R, which are an important data type.

12 / 63

Sorting: arrange()

Along with filtering the data to see certain rows, we might want to sort it:

yugoslavia %>% arrange(year, desc(pop))
## # A tibble: 60 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Serbia Europe 1952 58.0 6860147 3581.
## 2 Croatia Europe 1952 61.2 3882229 3119.
## 3 Bosnia and Herzegovina Europe 1952 53.8 2791000 974.
## 4 Slovenia Europe 1952 65.6 1489518 4215.
## 5 Montenegro Europe 1952 59.2 413834 2648.
## 6 Serbia Europe 1957 61.7 7271135 4981.
## 7 Croatia Europe 1957 64.8 3991242 4338.
## 8 Bosnia and Herzegovina Europe 1957 58.4 3076000 1354.
## 9 Slovenia Europe 1957 67.8 1533070 5862.
## 10 Montenegro Europe 1957 61.4 442829 3682.
## # ... with 50 more rows

The data are sorted by ascending year and descending pop.

13 / 63

Keeping Columns: select()

Not only can we subset rows, but we can include specific columns (and put them in the order listed) using select().

yugoslavia %>% select(country, year, pop) %>% head(4)
## # A tibble: 4 x 3
## country year pop
## <fct> <int> <int>
## 1 Bosnia and Herzegovina 1952 2791000
## 2 Bosnia and Herzegovina 1957 3076000
## 3 Bosnia and Herzegovina 1962 3349000
## 4 Bosnia and Herzegovina 1967 3585000
14 / 63

Dropping Columns: select()

We can instead drop only specific columns with select() using - signs:

yugoslavia %>% select(-continent, -pop, -lifeExp) %>% head(4)
## # A tibble: 4 x 3
## country year gdpPercap
## <fct> <int> <dbl>
## 1 Bosnia and Herzegovina 1952 974.
## 2 Bosnia and Herzegovina 1957 1354.
## 3 Bosnia and Herzegovina 1962 1710.
## 4 Bosnia and Herzegovina 1967 2172.
15 / 63

Helper Functions for select()

select() has a variety of helper functions like starts_with(), ends_with(), and contains(), or can be given a range of continguous columns startvar:endvar. See ?select for details.

These are very useful if you have a "wide" data frame with column names following a pattern or ordering.

DYS Data Example

DYS %>% select(starts_with("married"))
DYS %>% select(ends_with("18"))
16 / 63

Renaming Columns with select()

We can rename columns using select(), but that drops everything that isn't mentioned:

yugoslavia %>%
select(Life_Expectancy = lifeExp) %>%
head(4)
## # A tibble: 4 x 1
## Life_Expectancy
## <dbl>
## 1 53.8
## 2 58.4
## 3 61.9
## 4 64.8
17 / 63

Safer: Rename Columns with rename()

rename() renames variables using the same syntax as select() without dropping unmentioned variables.

yugoslavia %>%
select(country, year, lifeExp) %>%
rename(Life_Expectancy = lifeExp) %>%
head(4)
## # A tibble: 4 x 3
## country year Life_Expectancy
## <fct> <int> <dbl>
## 1 Bosnia and Herzegovina 1952 53.8
## 2 Bosnia and Herzegovina 1957 58.4
## 3 Bosnia and Herzegovina 1962 61.9
## 4 Bosnia and Herzegovina 1967 64.8
18 / 63

Creating Variables

19 / 63

mutate()

In dplyr, you can add new columns to a data frame using mutate().

yugoslavia %>% filter(country == "Serbia") %>%
select(year, pop, lifeExp) %>%
mutate(pop_million = pop / 1000000,
life_exp_past_40 = lifeExp - 40) %>%
head(5)
## # A tibble: 5 x 5
## year pop lifeExp pop_million life_exp_past_40
## <int> <int> <dbl> <dbl> <dbl>
## 1 1952 6860147 58.0 6.86 18.0
## 2 1957 7271135 61.7 7.27 21.7
## 3 1962 7616060 64.5 7.62 24.5
## 4 1967 7971222 66.9 7.97 26.9
## 5 1972 8313288 68.7 8.31 28.7

Note you can create multiple variables in a single mutate() call by separating the expressions with commas.

20 / 63

ifelse()

A common function used in mutate() (and in general in R programming) is ifelse(). It returns a vector of values depending on a logical test.

ifelse(test = x==y, yes = first_value , no = second_value)

Output from ifelse() if x==y is...

  • TRUE: first_value - the value for yes =

  • FALSE: second_value - the value for no =

  • NA: NA - because you can't test for NA with an equality!

21 / 63

ifelse()

A common function used in mutate() (and in general in R programming) is ifelse(). It returns a vector of values depending on a logical test.

ifelse(test = x==y, yes = first_value , no = second_value)

Output from ifelse() if x==y is...

  • TRUE: first_value - the value for yes =

  • FALSE: second_value - the value for no =

  • NA: NA - because you can't test for NA with an equality!

For example:

example <- c(1, 0, NA, -2)
ifelse(example > 0, "Positive", "Not Positive")
## [1] "Positive" "Not Positive" NA "Not Positive"
21 / 63

ifelse() Example

yugoslavia %>% mutate(short_country =
ifelse(country == "Bosnia and Herzegovina",
"B and H", as.character(country))) %>%
select(short_country, year, pop) %>%
arrange(year, short_country) %>%
head(3)
## # A tibble: 3 x 3
## short_country year pop
## <chr> <int> <int>
## 1 B and H 1952 2791000
## 2 Croatia 1952 3882229
## 3 Montenegro 1952 413834

Read this as "For each row, if country equals 'Bosnia and Herzegovina', make short_country equal to 'B and H', otherwise make it equal to that row's value of country."

This is a simple way to change some values but not others!

22 / 63

case_when()

case_when() performs multiple ifelse() operations at the same time. case_when() allows you to create a new variable with values based on multiple logical statements. This is useful for making categorical variables or variables from combinations of other variables.

gapminder %>%
mutate(gdpPercap_ordinal =
case_when(
gdpPercap < 700 ~ "low",
gdpPercap >= 700 & gdpPercap < 800 ~ "moderate",
TRUE ~ "high" )) %>% # Value when all other statements are FALSE
slice(6:9) # get rows 6 through 9
## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap gdpPercap_ordinal
## <fct> <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan Asia 1977 38.4 14880372 786. moderate
## 2 Afghanistan Asia 1982 39.9 12881816 978. high
## 3 Afghanistan Asia 1987 40.8 13867957 852. high
## 4 Afghanistan Asia 1992 41.7 16317921 649. low
23 / 63

Summarizing Data

24 / 63

General Aggregation: summarize()

summarize() takes your column(s) of data and computes something using every row:

  • Count how many rows there are
  • Calculate the mean
  • Compute the sum
  • Obtain a minimum or maximum value

You can use any function in summarize() that aggregates multiple values into a single value (like sd(), mean(), or max()).

25 / 63

summarize() Example

For the year 1982, let's get the number of observations, total population, mean life expectancy, and range of life expectancy for former Yugoslavian countries.

yugoslavia %>%
filter(year == 1982) %>%
summarize(n_obs = n(),
total_pop = sum(pop),
mean_life_exp = mean(lifeExp),
range_life_exp = max(lifeExp) - min(lifeExp))
## # A tibble: 1 x 4
## n_obs total_pop mean_life_exp range_life_exp
## <int> <int> <dbl> <dbl>
## 1 5 20042685 71.3 3.94

These new variables are calculated using all of the rows in yugoslavia

26 / 63

Avoiding Repetition:

summarize_at()

Maybe you need to calculate the mean and standard deviation of a bunch of columns. With summarize_at(), put the variables to compute over first vars() (using select() syntax) and put the functions to use in funs() after.

yugoslavia %>%
filter(year == 1982) %>%
summarize_at(vars(lifeExp, pop), list(~mean(.), ~sd(.)))
## # A tibble: 1 x 4
## lifeExp_mean pop_mean lifeExp_sd pop_sd
## <dbl> <dbl> <dbl> <dbl>
## 1 71.3 4008537 1.60 3237282.

Note it automatically names the summarized variables based on the functions used to summarize.

27 / 63

Avoiding Repetition

Other functions:

There are additional dplyr functions similar to summarize_at():

  • summarize_all() and mutate_all() summarize / mutate all variables sent to them in the same way. For instance, getting the mean and standard deviation of an entire dataframe:
dataframe %>% summarize_all(funs(mean, sd))
  • summarize_if() and mutate_if() summarize / mutate all variables that satisfy some logical condition. For instance, summarizing every numeric column in a dataframe at once:
dataframe %>% summarize_if(is.numeric, funs(mean, sd))

You can use all of these to avoid typing out the same code repeatedly!

28 / 63

group_by()

The special function group_by() changes how subsequent functions operate on the data, most importantly summarize().

Functions after group_by() are computed within each group as defined by unique valus of the variables given, rather than over all rows at once.

Typically the variables you group by will be integers, factors, or characters, and not continuous real values.

29 / 63

group_by() example

yugoslavia %>%
group_by(year) %>%
summarize(num_countries = n_distinct(country),
total_pop = sum(pop),
total_gdp_per_cap = sum(pop*gdpPercap)/total_pop) %>%
head(5)
## # A tibble: 5 x 4
## year num_countries total_pop total_gdp_per_cap
## <int> <int> <int> <dbl>
## 1 1952 5 15436728 3030.
## 2 1957 5 16314276 4187.
## 3 1962 5 17099107 5257.
## 4 1967 5 17878535 6656.
## 5 1972 5 18579786 8730.

Because we did group_by() with year then used summarize(), we get one row per value of year!

Each value of year is its own group!

30 / 63

Window Functions

Grouping can also be used with mutate() or filter() to give rank orders within a group, lagged values, and cumulative sums. You can read more about window functions in this vignette.

yugoslavia %>%
select(country, year, pop) %>%
filter(year >= 2002) %>%
group_by(country) %>%
mutate(lag_pop = lag(pop, order_by = year),
pop_chg = pop - lag_pop) %>%
head(4)
## # A tibble: 4 x 5
## # Groups: country [2]
## country year pop lag_pop pop_chg
## <fct> <int> <int> <int> <int>
## 1 Bosnia and Herzegovina 2002 4165416 NA NA
## 2 Bosnia and Herzegovina 2007 4552198 4165416 386782
## 3 Croatia 2002 4481020 NA NA
## 4 Croatia 2007 4493312 4481020 12292
31 / 63

Tidying Data

32 / 63

Initial Spot Checks

First things to check after loading new data:

33 / 63

Initial Spot Checks

First things to check after loading new data:

  • Did the last rows/columns from the original file make it in?

    • May need to use different package or manually specify range
33 / 63

Initial Spot Checks

First things to check after loading new data:

  • Did the last rows/columns from the original file make it in?

    • May need to use different package or manually specify range
  • Are the column names in good shape?

    • Modify a col_names= argument or fix with rename()
33 / 63

Initial Spot Checks

First things to check after loading new data:

  • Did the last rows/columns from the original file make it in?

    • May need to use different package or manually specify range
  • Are the column names in good shape?

    • Modify a col_names= argument or fix with rename()
  • Are there "decorative" blank rows or columns to remove?

    • filter() or select() out those rows/columns
33 / 63

Initial Spot Checks

First things to check after loading new data:

  • Did the last rows/columns from the original file make it in?

    • May need to use different package or manually specify range
  • Are the column names in good shape?

    • Modify a col_names= argument or fix with rename()
  • Are there "decorative" blank rows or columns to remove?

    • filter() or select() out those rows/columns
  • How are missing values represented: NA, " " (blank), . (period), 999?

    • Use mutate() with ifelse() to fix these (perhaps en masse with looping)
33 / 63

Initial Spot Checks

First things to check after loading new data:

  • Did the last rows/columns from the original file make it in?

    • May need to use different package or manually specify range
  • Are the column names in good shape?

    • Modify a col_names= argument or fix with rename()
  • Are there "decorative" blank rows or columns to remove?

    • filter() or select() out those rows/columns
  • How are missing values represented: NA, " " (blank), . (period), 999?

    • Use mutate() with ifelse() to fix these (perhaps en masse with looping)
  • Are there character data (e.g. ZIP codes with leading zeroes) being incorrectly represented as numeric or vice versa?

    • Modify col_types= argument, or use mutate() and as.numeric()
33 / 63

Slightly Messy Data

Program Female Male
Evans School 10 6
Arts & Sciences 5 6
Public Health 2 3
Other 5 1
34 / 63

Slightly Messy Data

Program Female Male
Evans School 10 6
Arts & Sciences 5 6
Public Health 2 3
Other 5 1
  • What is an observation?
    • A group of students from a program of a given gender
  • What are the variables?
    • Program, Gender, Count
  • What are the values?
    • Program: Evans School, Arts & Sciences, Public Health, Other
    • Gender: Female, Male -- in the column headings, not its own column!
    • Count: spread over two columns!
34 / 63

Tidy Version

Program Gender Count
Evans School Female 10
Evans School Male 6
Arts & Sciences Female 5
Arts & Sciences Male 6
Public Health Female 2
Public Health Male 3
Other Female 5
Other Male 1

Each variable is a column.

Each observation is a row.

Ready to throw into ggplot() or a model!

35 / 63

Billboard Data

We're going to work with some ugly data: The Billboard Hot 100 for the year 2000.

We can load it like so:

library(readr) # Contains read_csv()
billboard_2000_raw <-
read_csv(file = "https://github.com/clanfear/Intermediate_R_Workshop/raw/master/data/billboard.csv",
col_types = paste(c("icccD", rep("i", 76)), collapse=""))

col_types= is used to specify column types. See here for details.

36 / 63

Billboard is Just Ugly-Messy

year artist track time date.entered wk1 wk2 wk3 wk4 wk5
2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 82 72 77 87
2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87 92 NA NA
2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 68 67 66
2000 3 Doors Down Loser 4:24 2000-10-21 76 76 72 69 67
2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34 25 17 17
2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 51 39 34 26 26
2000 A*Teens Dancing Queen 3:44 2000-07-08 97 97 96 95 100
2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84 62 51 41 38
2000 Aaliyah Try Again 4:03 2000-03-18 59 53 38 28 21
2000 Adams, Yolanda Open My Heart 5:30 2000-08-26 76 76 74 69 68
2000 Adkins, Trace More 3:05 2000-04-29 84 84 75 73 73
2000 Aguilera, Christina Come On Over Baby (A... 3:38 2000-08-05 57 47 45 29 23

Week columns continue up to wk76!

37 / 63

Billboard

  • What are the observations in the data?
38 / 63

Billboard

  • What are the observations in the data?

    • Week since entering the Billboard Hot 100 per song
38 / 63

Billboard

  • What are the observations in the data?

    • Week since entering the Billboard Hot 100 per song
  • What are the variables in the data?

38 / 63

Billboard

  • What are the observations in the data?

    • Week since entering the Billboard Hot 100 per song
  • What are the variables in the data?

    • Year, artist, track, song length, date entered Hot 100, week since first entered Hot 100 (spread over many columns), rank during week (spread over many columns)
38 / 63

Billboard

  • What are the observations in the data?

    • Week since entering the Billboard Hot 100 per song
  • What are the variables in the data?

    • Year, artist, track, song length, date entered Hot 100, week since first entered Hot 100 (spread over many columns), rank during week (spread over many columns)
  • What are the values in the data?

38 / 63

Billboard

  • What are the observations in the data?

    • Week since entering the Billboard Hot 100 per song
  • What are the variables in the data?

    • Year, artist, track, song length, date entered Hot 100, week since first entered Hot 100 (spread over many columns), rank during week (spread over many columns)
  • What are the values in the data?

    • e.g. 2000; 3 Doors Down; Kryptonite; 3 minutes 53 seconds; April 8, 2000; Week 3 (stuck in column headings); rank 68 (spread over many columns)
38 / 63

Tidy Data

Tidy data (aka "long data") are such that:

39 / 63

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
39 / 63

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
  2. The values for a single variable are in their own column.
39 / 63

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
  2. The values for a single variable are in their own column.
  3. The observations are all of the same nature.
39 / 63

Tidy Data

Tidy data (aka "long data") are such that:

  1. The values for a single observation are in their own row.
  2. The values for a single variable are in their own column.
  3. The observations are all of the same nature.

Why do we want tidy data?

  • Easier to understand many rows than many columns
  • Required for plotting in ggplot2
  • Required for many types of statistical procedures (e.g. hierarchical or mixed effects models)
  • Fewer confusing variable names
  • Fewer issues with missing values and "imbalanced" repeated measures data
39 / 63

tidyr

The tidyr package provides functions to tidy up data, similar to reshape in Stata or varstocases in SPSS. Key functions:

40 / 63

tidyr

The tidyr package provides functions to tidy up data, similar to reshape in Stata or varstocases in SPSS. Key functions:

  • gather(): takes a set of columns and rotates them down to make two new columns (which you can name yourself):
    • A key that stores the original column names
    • A value with the values in those original columns
40 / 63

tidyr

The tidyr package provides functions to tidy up data, similar to reshape in Stata or varstocases in SPSS. Key functions:

  • gather(): takes a set of columns and rotates them down to make two new columns (which you can name yourself):
    • A key that stores the original column names
    • A value with the values in those original columns
  • spread(): inverts gather() by taking two columns and rotating them up into multiple columns
40 / 63

tidyr

The tidyr package provides functions to tidy up data, similar to reshape in Stata or varstocases in SPSS. Key functions:

  • gather(): takes a set of columns and rotates them down to make two new columns (which you can name yourself):
    • A key that stores the original column names
    • A value with the values in those original columns
  • spread(): inverts gather() by taking two columns and rotating them up into multiple columns

  • separate(): pulls apart one column into multiple columns (common with gathered data where values had been embedded in column names)

    • extract_numeric() does a simple version of this for the common case when you just want grab the number part
40 / 63

tidyr

The tidyr package provides functions to tidy up data, similar to reshape in Stata or varstocases in SPSS. Key functions:

  • gather(): takes a set of columns and rotates them down to make two new columns (which you can name yourself):
    • A key that stores the original column names
    • A value with the values in those original columns
  • spread(): inverts gather() by taking two columns and rotating them up into multiple columns

  • separate(): pulls apart one column into multiple columns (common with gathered data where values had been embedded in column names)

    • extract_numeric() does a simple version of this for the common case when you just want grab the number part
40 / 63

gather()

Let's use gather() to get the week and rank variables out of their current layout into two columns (big increase in rows, big drop in columns):

library(tidyr)
billboard_2000 <- billboard_2000_raw %>%
gather(key = week, value = rank, starts_with("wk"))
dim(billboard_2000)
## [1] 24092 7

starts_with() and other helper functions from dplyr::select() work here too.

We could instead use: gather(key = week, value = rank, wk1:wk76) to pull out these contiguous columns.

41 / 63

gathered Weeks

head(billboard_2000)
## # A tibble: 6 x 7
## year artist track time date.entered week rank
## <int> <chr> <chr> <chr> <date> <chr> <int>
## 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 87
## 2 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1 91
## 3 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1 81
## 4 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76
## 5 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1 57
## 6 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 wk1 51

Now we have a single week column!

42 / 63

Gathering Better?

summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 26.00 51.00 51.05 76.00 100.00 18785

This is an improvement, but we don't want to keep the 18785 rows with missing ranks (i.e. observations for weeks since entering the Hot 100 that the song was no longer on the Hot 100).

43 / 63

Gathering Better: na.rm

The argument na.rm = TRUE to gather() will remove rows with missing ranks.

billboard_2000 <- billboard_2000_raw %>%
gather(key = week, value = rank, starts_with("wk"),
na.rm = TRUE)
summary(billboard_2000$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 26.00 51.00 51.05 76.00 100.00
44 / 63

separate()

The track length column isn't analytically friendly. Let's convert it to a number rather than the character (minutes:seconds) format:

billboard_2000 <- billboard_2000 %>%
separate(time, into = c("minutes", "seconds"),
sep = ":", convert = TRUE) %>%
mutate(length = minutes + seconds / 60) %>%
select(-minutes, -seconds)
summary(billboard_2000$length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.600 3.667 3.933 4.031 4.283 7.833

sep = : tells separate() to split the column into two where it finds a colon (:).

Then we add seconds / 60 to minutes to produce a numeric length in minutes.

45 / 63

parse_number()

tidyr provides a convenience function to grab just the numeric information from a column that mixes text and numbers:

billboard_2000 <- billboard_2000 %>%
mutate(week = parse_number(week))
summary(billboard_2000$week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 10.00 11.47 16.00 65.00

For more sophisticated conversion or pattern checking, you'll need to use string parsing (to be covered in week 8).

46 / 63

spread() Motivation

spread() is the opposite of gather(), which you use if you have data for the same observation taking up multiple rows.

47 / 63

spread() Motivation

spread() is the opposite of gather(), which you use if you have data for the same observation taking up multiple rows.

Example of data that we probably want to spread (unless we want to plot each statistic in its own facet):

Group Statistic Value
A Mean 1.28
A Median 1.0
A SD 0.72
B Mean 2.81
B Median 2
B SD 1.33

A common cue to use spread() is having measurements of different quantities in the same column.

47 / 63

Before spread()

(too_long_data <- data.frame(Group = c(rep("A", 3), rep("B", 3)),
Statistic = rep(c("Mean", "Median", "SD"), 2),
Value = c(1.28, 1.0, 0.72, 2.81, 2, 1.33)))
## Group Statistic Value
## 1 A Mean 1.28
## 2 A Median 1.00
## 3 A SD 0.72
## 4 B Mean 2.81
## 5 B Median 2.00
## 6 B SD 1.33
48 / 63

After spread()

(just_right_data <- too_long_data %>%
spread(key = Statistic, value = Value))
## Group Mean Median SD
## 1 A 1.28 1 0.72
## 2 B 2.81 2 1.33
49 / 63

Charts of 2000: Data Prep

Let's look at songs that hit #1 at some point and look how they got there versus songs that did not:

# find best rank for each song
best_rank <- billboard_2000 %>%
group_by(artist, track) %>%
summarize(min_rank = min(rank),
weeks_at_1 = sum(rank == 1)) %>%
mutate(`Peak rank` = ifelse(min_rank == 1,
"Hit #1",
"Didn't #1"))
# merge onto original data
billboard_2000 <- billboard_2000 %>%
left_join(best_rank, by = c("artist", "track"))

Note that because the "highest" rank is numerically lowest (1), we are summarizing with min().

50 / 63

Which Were #1 the Most Weeks?

billboard_2000 %>%
select(artist, track, weeks_at_1) %>%
distinct(artist, track, weeks_at_1) %>%
arrange(desc(weeks_at_1)) %>%
head(7)
## # A tibble: 7 x 3
## artist track weeks_at_1
## <chr> <chr> <int>
## 1 Destiny's Child Independent Women Pa... 11
## 2 Santana Maria, Maria 10
## 3 Aguilera, Christina Come On Over Baby (A... 4
## 4 Madonna Music 4
## 5 Savage Garden I Knew I Loved You 4
## 6 Destiny's Child Say My Name 3
## 7 Iglesias, Enrique Be With You 3
51 / 63

Getting Usable Dates

We have the date the songs first charted, but not the dates for later weeks. We can calculate these now that the data are tidy:

billboard_2000 <- billboard_2000 %>%
mutate(date = date.entered + (week - 1) * 7)
billboard_2000 %>% arrange(artist, track, week) %>%
select(artist, date.entered, week, date, rank) %>% head(4)
## # A tibble: 4 x 5
## artist date.entered week date rank
## <chr> <date> <dbl> <date> <int>
## 1 2 Pac 2000-02-26 1 2000-02-26 87
## 2 2 Pac 2000-02-26 2 2000-03-04 82
## 3 2 Pac 2000-02-26 3 2000-03-11 72
## 4 2 Pac 2000-02-26 4 2000-03-18 77

This works because date objects are in units of days—we just add 7 days per week to the start date.

52 / 63

Joining Data

53 / 63

When Do We Need to Join Data?

  • Want to make columns using criteria too complicated for ifelse() or case_when()

    • We can work with small sets of variables then combine them back together.
  • Combine data stored in separate data sets: e.g. UW registrar information with police stop records.

    • Often large surveys are broken into different data sets for each level (e.g. household, individual, neighborhood)
54 / 63

Joining in Concept

We need to think about the following when we want to merge data frames A and B:

  • Which rows are we keeping from each data frame?

  • Which columns are we keeping from each data frame?

  • Which variables determine whether rows match?

55 / 63

Join Types: Rows and columns kept

There are many types of joins1...

  • A %>% left_join(B): keep all rows from A, matched with B wherever possible (NA when not), keep columns from both A and B

  • A %>% right_join(B): keep all rows from B, matched with A wherever possible (NA when not), keep columns from both A and B

  • A %>% inner_join(B): keep only rows from A and B that match, keep columns from both A and B

  • A %>% full_join(B): keep all rows from both A and B, matched wherever possible (NA when not), keep columns from both A and B

  • A %>% semi_join(B): keep rows from A that match rows in B, keep columns from only A

  • A %>% anti_join(B): keep rows from A that don't match a row in B, keep columns from only A

[1] Usually left_join() does the job.

56 / 63

Matching Criteria

We say rows should match because they have some columns containing the same value. We list these in a by = argument to the join.

Matching Behavior:

  • No by: Match using all variables in A and B that have identical names
57 / 63

Matching Criteria

We say rows should match because they have some columns containing the same value. We list these in a by = argument to the join.

Matching Behavior:

  • No by: Match using all variables in A and B that have identical names

  • by = c("var1", "var2", "var3"): Match on identical values of var1, var2, and var3 in both A and B

57 / 63

Matching Criteria

We say rows should match because they have some columns containing the same value. We list these in a by = argument to the join.

Matching Behavior:

  • No by: Match using all variables in A and B that have identical names

  • by = c("var1", "var2", "var3"): Match on identical values of var1, var2, and var3 in both A and B

  • by = c("Avar1" = "Bvar1", "Avar2" = "Bvar2"): Match identical values of Avar1 variable in A to Bvar1 variable in B, and Avar2 variable in A to Bvar2 variable in B

Note: If there are multiple matches, you'll get one row for each possible combination (except with semi_join() and anti_join()).

Need to get more complicated? Break it into multiple operations.

57 / 63

nycflights13 Data

We'll use data in the nycflights13 package.

library(nycflights13)

It includes five dataframes, some of which contain missing data (NA):

  • flights: flights leaving JFK, LGA, or EWR in 2013
  • airlines: airline abbreviations
  • airports: airport metadata
  • planes: airplane metadata
  • weather: hourly weather data for JFK, LGA, and EWR

Note these are separate data frames, each needing to be loaded separately:

data(flights)
data(airlines)
data(airports)
# and so on...
58 / 63

Join Example

Who manufactures the planes that flew to SeaTac?

flights %>% filter(dest == "SEA") %>% select(tailnum) %>%
left_join(planes %>% select(tailnum, manufacturer),
by = "tailnum") %>%
count(manufacturer) %>% # Count observations by manufacturer
arrange(desc(n)) # Arrange data descending by count
## # A tibble: 6 x 2
## manufacturer n
## <chr> <int>
## 1 BOEING 2659
## 2 AIRBUS 475
## 3 AIRBUS INDUSTRIE 394
## 4 <NA> 391
## 5 BARKER JACK L 2
## 6 CIRRUS DESIGN CORP 2

Note you can perform operations on the data inside functions such as left_join() and the output will be used by the function.

59 / 63

Visualization Preview

The next workshop will focus on visualization using ggplot2.

We could visualize the data we worked with today to understand it better.

60 / 63

Charts of 2000: ggplot2

library(ggplot2)
billboard_trajectories <-
ggplot(data = billboard_2000,
aes(x = week, y = rank, group = track,
color = `Peak rank`)
) +
geom_line(aes(size = `Peak rank`), alpha = 0.4) +
# rescale time: early weeks more important
scale_x_log10(breaks = seq(0, 70, 10)) +
scale_y_reverse() + # want rank 1 on top, not bottom
theme_classic() +
xlab("Week") + ylab("Rank") +
scale_color_manual("Peak Rank", values = c("black", "red")) +
scale_size_manual("Peak Rank", values = c(0.25, 1)) +
theme(legend.position = c(0.90, 0.25),
legend.background = element_rect(fill="transparent"))
61 / 63

Charts of 2000: Beauty!

Observation: There appears to be censoring around week 20 for songs falling out of the top 50 that I'd want to follow up on.

62 / 63

Resources

  • UW CSSS508: My University of Washington Introduction to R course which forms the basis for this workshop. All content including lecture videos is freely available.
  • R for Data Science online textbook by Garrett Grolemund and Hadley Wickham. One of many good R texts available, but importantly it is free and focuses on the tidyverse collection of R packages which are the modern standard for data manipulation and visualization in R.
  • Advanced R online textbook by Hadley Wickham. A great source for more in-depth and advanced R programming.
  • DataCamp: A source for interactive R tutorials (some free of charge).
  • swirl: Interactive tutorials inside R.
  • Useful RStudio cheatsheets on R Markdown, RStudio shortcuts, etc.
63 / 63

Overview

  1. Subsetting Data

  2. Creating Variables

  3. Summarizing Data

  4. Tidying Data

  5. Joining Data

  6. Resources for Further Learning

2 / 63
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow