Let's return to the (real!) 2021 crime data from London boroughs...
library(tidyverse)library(broom) metro_wide <- read_csv("https://clanfear.github.io/ioc_iqa/_data/metro_2021_violence_wide.csv")head(metro_wide, 4)
## # A tibble: 4 × 17## borough deprivation subregion pop area month_1 month_2 month_3## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Barkin… High East 221495 36.1 469 449 561## 2 Barnet Medium North 411275 86.7 530 580 772## 3 Bexley Low East 256845 60.6 370 343 507## 4 Brent Medium West 346437 43.2 643 669 792## # ℹ 9 more variables: month_4 <dbl>, month_5 <dbl>, month_6 <dbl>,## # month_7 <dbl>, month_8 <dbl>, month_9 <dbl>, month_10 <dbl>,## # month_11 <dbl>, month_12 <dbl>
But in wide format this time
Wide data are data where at least one variable occupies multiple columns
dim(metro_wide) # Few rows, more columns
## [1] 32 17
names(metro_wide) # A column for each month!
## [1] "borough" "deprivation" "subregion" "pop" ## [5] "area" "month_1" "month_2" "month_3" ## [9] "month_4" "month_5" "month_6" "month_7" ## [13] "month_8" "month_9" "month_10" "month_11" ## [17] "month_12"
Here the month variable is in the column names and counts of violent crime are under each month
Wide data are data where at least one variable occupies multiple columns
dim(metro_wide) # Few rows, more columns
## [1] 32 17
names(metro_wide) # A column for each month!
## [1] "borough" "deprivation" "subregion" "pop" ## [5] "area" "month_1" "month_2" "month_3" ## [9] "month_4" "month_5" "month_6" "month_7" ## [13] "month_8" "month_9" "month_10" "month_11" ## [17] "month_12"
Here the month variable is in the column names and counts of violent crime are under each month
How could we plot crime by month with these data?
Wide data are data where at least one variable occupies multiple columns
dim(metro_wide) # Few rows, more columns
## [1] 32 17
names(metro_wide) # A column for each month!
## [1] "borough" "deprivation" "subregion" "pop" ## [5] "area" "month_1" "month_2" "month_3" ## [9] "month_4" "month_5" "month_6" "month_7" ## [13] "month_8" "month_9" "month_10" "month_11" ## [17] "month_12"
Here the month variable is in the column names and counts of violent crime are under each month
How could we plot crime by month with these data?
It'd be pretty hard with ggplot2
!
What we want is data in tidy format
What we want is data in tidy format
Tidy data (aka "long data") are such that:
[1] What one value means is subjective—it could be an entire dataset
What we want is data in tidy format
Tidy data (aka "long data") are such that:
[1] What one value means is subjective—it could be an entire dataset
Why do we want tidy data?
What we want is data in tidy format
Tidy data (aka "long data") are such that:
[1] What one value means is subjective—it could be an entire dataset
Why do we want tidy data?
So how do we tidy things up?
The {tidyr}
package in {tidyverse}
was built to get data into tidy format
metro_long <- metro_wide |> pivot_longer(starts_with("month"), # Take each month column names_to = "month", # Put col names in month column values_to = "violence") # Put values in violence columnmetro_long |> select(borough, month, violence) |> head(3)
## # A tibble: 3 × 3## borough month violence## <chr> <chr> <dbl>## 1 Barking and Dagenham month_1 469## 2 Barking and Dagenham month_2 449## 3 Barking and Dagenham month_3 561
The {tidyr}
package in {tidyverse}
was built to get data into tidy format
metro_long <- metro_wide |> pivot_longer(starts_with("month"), # Take each month column names_to = "month", # Put col names in month column values_to = "violence") # Put values in violence columnmetro_long |> select(borough, month, violence) |> head(3)
## # A tibble: 3 × 3## borough month violence## <chr> <chr> <dbl>## 1 Barking and Dagenham month_1 469## 2 Barking and Dagenham month_2 449## 3 Barking and Dagenham month_3 561
dim(metro_long)
## [1] 384 7
Way fewer columns and way more rows!
parse_number()
To make month
analytically friendly, we can convert it from strings like "month_1"
to numbers using parse_number()
metro_long <- metro_long |> mutate(month = parse_number(month))head(metro_long)
## # A tibble: 6 × 7## borough deprivation subregion pop area month violence## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>## 1 Barking and Dag… High East 221495 36.1 1 469## 2 Barking and Dag… High East 221495 36.1 2 449## 3 Barking and Dag… High East 221495 36.1 3 561## 4 Barking and Dag… High East 221495 36.1 4 564## 5 Barking and Dag… High East 221495 36.1 5 589## 6 Barking and Dag… High East 221495 36.1 6 617
parse_number()
To make month
analytically friendly, we can convert it from strings like "month_1"
to numbers using parse_number()
metro_long <- metro_long |> mutate(month = parse_number(month))head(metro_long)
## # A tibble: 6 × 7## borough deprivation subregion pop area month violence## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>## 1 Barking and Dag… High East 221495 36.1 1 469## 2 Barking and Dag… High East 221495 36.1 2 449## 3 Barking and Dag… High East 221495 36.1 3 561## 4 Barking and Dag… High East 221495 36.1 4 564## 5 Barking and Dag… High East 221495 36.1 5 589## 6 Barking and Dag… High East 221495 36.1 6 617
Now we can run some models again—but first, let's see pivoting back
pivot_longer()
Maybe you want the wider data, because your model needs it (e.g., SEM) or to make a data table
metro_wide_again <- metro_long |> pivot_wider(names_from = month, # Turn months into col names values_from = violence, # Turn violence in col values names_prefix = "month_") # Start col names with month_metro_wide_again |> head(3)
## # A tibble: 3 × 17## borough deprivation subregion pop area month_1 month_2 month_3## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Barkin… High East 221495 36.1 469 449 561## 2 Barnet Medium North 411275 86.7 530 580 772## 3 Bexley Low East 256845 60.6 370 343 507## # ℹ 9 more variables: month_4 <dbl>, month_5 <dbl>, month_6 <dbl>,## # month_7 <dbl>, month_8 <dbl>, month_9 <dbl>, month_10 <dbl>,## # month_11 <dbl>, month_12 <dbl>
lm()
Let's fit a simple model predicting crime by month
lm_viol <- lm(violence ~ month, data = metro_long)lm_viol |> tidy() |> select(term, estimate, std.error)
## # A tibble: 2 × 3## term estimate std.error## <chr> <dbl> <dbl>## 1 (Intercept) 553. 20.5 ## 2 month 10.4 2.78
lm()
Let's fit a simple model predicting crime by month
lm_viol <- lm(violence ~ month, data = metro_long)lm_viol |> tidy() |> select(term, estimate, std.error)
## # A tibble: 2 × 3## term estimate std.error## <chr> <dbl> <dbl>## 1 (Intercept) 553. 20.5 ## 2 month 10.4 2.78
But would we expect crime to increase (or decrease) from January to December?
lm()
Let's fit a simple model predicting crime by month
lm_viol <- lm(violence ~ month, data = metro_long)lm_viol |> tidy() |> select(term, estimate, std.error)
## # A tibble: 2 × 3## term estimate std.error## <chr> <dbl> <dbl>## 1 (Intercept) 553. 20.5 ## 2 month 10.4 2.78
But would we expect crime to increase (or decrease) from January to December?
Or is it more likely to range up and down through the year?
Looking at residuals can help us figure this out
lm_viol |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red")
Looking at residuals can help us figure this out
lm_viol |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red")
The residuals should have a mean of zero at every value of month
Looking at residuals can help us figure this out
lm_viol |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red")
The residuals should have a mean of zero at every value of month
With many data points it can be hard to tell if there is anything going on!
geom_smooth()
can help us diagnose problems
By default, it fits a LOESS smoother or a spline—a flexible curved line
lm_viol |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
geom_smooth()
can help us diagnose problems
By default, it fits a LOESS smoother or a spline—a flexible curved line
lm_viol |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
It looks like we're failing to account for a curved relationship
geom_smooth()
can help us diagnose problems
By default, it fits a LOESS smoother or a spline—a flexible curved line
lm_viol |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
It looks like we're failing to account for a curved relationship
So why not make a curved line?
Recall that a squared term (e.g. x2) in an equation creates a parabola:
y=2+0.5x+0.25x2
Recall that a squared term (e.g. x2) in an equation creates a parabola:
y=2+0.5x+0.25x2
curve(2 + 0.5*x + 0.25*x^2, from = -3, to = 1, ylab = "y")
Recall that a squared term (e.g. x2) in an equation creates a parabola:
y=2+0.5x+0.25x2
curve(2 + 0.5*x + 0.25*x^2, from = -3, to = 1, ylab = "y")
We can create these in our regression model!
We can create the quadratic term directly in the formula using I(x^2)
for any x
1
lm_viol_2 <- lm(violence ~ month + I(month^2), data = metro_long)lm_viol_2 |> tidy() |> select(term, estimate, std.error, p.value)
## # A tibble: 3 × 4## term estimate std.error p.value## <chr> <dbl> <dbl> <dbl>## 1 (Intercept) 431. 33.6 1.24e-31## 2 month 62.5 11.9 2.38e- 7## 3 I(month^2) -4.01 0.889 8.77e- 6
[1] We could also create a squared term in the data using mutate()
We can create the quadratic term directly in the formula using I(x^2)
for any x
1
lm_viol_2 <- lm(violence ~ month + I(month^2), data = metro_long)lm_viol_2 |> tidy() |> select(term, estimate, std.error, p.value)
## # A tibble: 3 × 4## term estimate std.error p.value## <chr> <dbl> <dbl> <dbl>## 1 (Intercept) 431. 33.6 1.24e-31## 2 month 62.5 11.9 2.38e- 7## 3 I(month^2) -4.01 0.889 8.77e- 6
[1] We could also create a squared term in the data using mutate()
The statistical significance of the new term suggests there is indeed a non-linear association between month
and violence
geom_smooth()
can produce polynomials using the same formula
geom_smooth()
can produce polynomials using the same formula
metro_long |> ggplot(aes(x = month, y = violence)) + geom_point() + geom_smooth( method = "lm", formula = y ~ x + I(x^2))
geom_smooth()
can produce polynomials using the same formula
metro_long |> ggplot(aes(x = month, y = violence)) + geom_point() + geom_smooth( method = "lm", formula = y ~ x + I(x^2))
That's a pretty strong curve!
Looks like violence peaks in the summer—something we'd expect theoretically!
We can use the residuals from the new model for a diagnostic plot
We can use the residuals from the new model for a diagnostic plot
lm_viol_2 |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
We can use the residuals from the new model for a diagnostic plot
lm_viol_2 |> augment() |> ggplot(aes(x = month, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
Looks like the quadratic term fixed the problem!
There's still a lot of noise, but we don't appear to be systematically wrong anymore
lm_viol_2 |> tidy() |> select(term, estimate, std.error, p.value)
## # A tibble: 3 × 4## term estimate std.error p.value## <chr> <dbl> <dbl> <dbl>## 1 (Intercept) 431. 33.6 1.24e-31## 2 month 62.5 11.9 2.38e- 7## 3 I(month^2) -4.01 0.889 8.77e- 6
How do we interpret this model?
lm_viol_2 |> tidy() |> select(term, estimate, std.error, p.value)
## # A tibble: 3 × 4## term estimate std.error p.value## <chr> <dbl> <dbl> <dbl>## 1 (Intercept) 431. 33.6 1.24e-31## 2 month 62.5 11.9 2.38e- 7## 3 I(month^2) -4.01 0.889 8.77e- 6
How do we interpret this model?
month
terms separately?lm_viol_2 |> tidy() |> select(term, estimate, std.error, p.value)
## # A tibble: 3 × 4## term estimate std.error p.value## <chr> <dbl> <dbl> <dbl>## 1 (Intercept) 431. 33.6 1.24e-31## 2 month 62.5 11.9 2.38e- 7## 3 I(month^2) -4.01 0.889 8.77e- 6
How do we interpret this model?
Does it make sense to interpret the month
terms separately?
month
squared can't increase without month
increasing—we can't hold one constant
lm_viol_2 |> tidy() |> select(term, estimate, std.error, p.value)
## # A tibble: 3 × 4## term estimate std.error p.value## <chr> <dbl> <dbl> <dbl>## 1 (Intercept) 431. 33.6 1.24e-31## 2 month 62.5 11.9 2.38e- 7## 3 I(month^2) -4.01 0.889 8.77e- 6
How do we interpret this model?
Does it make sense to interpret the month
terms separately?
month
squared can't increase without month
increasing—we can't hold one constant
Remember: Coefficients are the slopes of lines, and curves have different slopes in different places!
In week one we used derivatives to calculate the slope of a curve
In week one we used derivatives to calculate the slope of a curve
Given y=a+xn, dydx=nxn−1
In week one we used derivatives to calculate the slope of a curve
Given y=a+xn, dydx=nxn−1
So...
In week one we used derivatives to calculate the slope of a curve
Given y=a+xn, dydx=nxn−1
So...
For polynomials, the effect of a one unit change depends where you look:
In week one we used derivatives to calculate the slope of a curve
Given y=a+xn, dydx=nxn−1
So...
For polynomials, the effect of a one unit change depends where you look:
We'll spend more time with this later!
Let's go a step further with a cubic model and compare it to the other models
Let's go a step further with a cubic model and compare it to the other models
Polynomials obey a hierarchy rule: Always include lower order terms
lm_viol_3 <- lm(violence ~ month + I(month^2) + I(month^3), data = metro_long)rbind(glance(lm_viol), glance(lm_viol_2), glance(lm_viol_3)) |> select(r.squared, adj.r.squared)
## # A tibble: 3 × 2## r.squared adj.r.squared## <dbl> <dbl>## 1 0.0352 0.0326## 2 0.0840 0.0792## 3 0.0845 0.0773
How do they compare?
Let's go a step further with a cubic model and compare it to the other models
Polynomials obey a hierarchy rule: Always include lower order terms
lm_viol_3 <- lm(violence ~ month + I(month^2) + I(month^3), data = metro_long)rbind(glance(lm_viol), glance(lm_viol_2), glance(lm_viol_3)) |> select(r.squared, adj.r.squared)
## # A tibble: 3 × 2## r.squared adj.r.squared## <dbl> <dbl>## 1 0.0352 0.0326## 2 0.0840 0.0792## 3 0.0845 0.0773
How do they compare?
Looks like the cubic model isn't helping much!
map()
You may find yourself running the same function repeatedly:
rbind(glance(lm_viol), glance(lm_viol_2), glance(lm_viol_3))
map()
You may find yourself running the same function repeatedly:
rbind(glance(lm_viol), glance(lm_viol_2), glance(lm_viol_3))
map()
or lapply()
can iterate across elements of any object:
list(lm_viol, lm_viol_2, lm_viol_3) |> # combine models into a list map(glance) |> # run glance() on each element list_rbind() # bind results into rows
## # A tibble: 3 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.0352 0.0326 188. 13.9 2.20e-4 1 -2555. 5117.## 2 0.0840 0.0792 184. 17.5 5.51e-8 2 -2545. 5099.## 3 0.0845 0.0773 184. 11.7 2.40e-7 3 -2545. 5100.## # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,## # nobs <int>
map()
You may find yourself running the same function repeatedly:
rbind(glance(lm_viol), glance(lm_viol_2), glance(lm_viol_3))
map()
or lapply()
can iterate across elements of any object:
list(lm_viol, lm_viol_2, lm_viol_3) |> # combine models into a list map(glance) |> # run glance() on each element list_rbind() # bind results into rows
## # A tibble: 3 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.0352 0.0326 188. 13.9 2.20e-4 1 -2555. 5117.## 2 0.0840 0.0792 184. 17.5 5.51e-8 2 -2545. 5099.## 3 0.0845 0.0773 184. 11.7 2.40e-7 3 -2545. 5100.## # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,## # nobs <int>
We won't use this much, but it is very useful!
anova()
We can test the models against each other formally with anova()
as before
anova(lm_viol, lm_viol_2, lm_viol_3)
## Analysis of Variance Table## ## Model 1: violence ~ month## Model 2: violence ~ month + I(month^2)## Model 3: violence ~ month + I(month^2) + I(month^3)## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 382 13547649 ## 2 381 12861959 1 685690 20.2701 8.956e-06 ***## 3 380 12854521 1 7438 0.2199 0.6394 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How do they compare?
anova()
We can test the models against each other formally with anova()
as before
anova(lm_viol, lm_viol_2, lm_viol_3)
## Analysis of Variance Table## ## Model 1: violence ~ month## Model 2: violence ~ month + I(month^2)## Model 3: violence ~ month + I(month^2) + I(month^3)## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 382 13547649 ## 2 381 12861959 1 685690 20.2701 8.956e-06 ***## 3 380 12854521 1 7438 0.2199 0.6394 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How do they compare?
No support for the cubic model here either!
Before we plotted residuals against month
—but a more general diagnostic is plotting residuals against the fitted values
lm_viol |> augment() |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
Before we plotted residuals against month
—but a more general diagnostic is plotting residuals against the fitted values
lm_viol |> augment() |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
This looks very familiar—it still suggests there's a curve!
This one looks a little different with our quadratic term included
lm_viol_2 |> augment() |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
This one looks a little different with our quadratic term included
lm_viol_2 |> augment() |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
It still looks like we're doing a good job, though maybe there's a bit of poor prediction for very high values
This one looks a little different with our quadratic term included
lm_viol_2 |> augment() |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_smooth()
It still looks like we're doing a good job, though maybe there's a bit of poor prediction for very high values
Basic guidelines:
Reading:
Huntington-Klein, N. (2022) The Effect: An Introduction to Research Design and Causality, New York, NY: Chapman and Hall/CRC Press.
Kaplan, J. (2022) Crime by the Numbers: A Criminologist's Guide to R
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |