class: center, top, title-slide .title[ # Linear Models IV ] .subtitle[ ## IQA Lecture 7 ] .author[ ### Charles Lanfear ] .date[ ### 27 Nov 2024
Updated: 27 Nov 2024 ] --- # Today ### Joining Data ### Logarithms ### Moderation --- # Loading Data ``` r library(tidyverse) library(broom) library(janitor) ``` Often you have multiple sources of data that you want to use together .text-85[ ``` r metro_2021_crime <- read_csv("https://clanfear.github.io/ioc_iqa/_data/metro_2021_crime.csv") borough_deprivation <- read_csv("https://clanfear.github.io/ioc_iqa/_data/borough_deprivation.csv") borough_pop_density <- read_csv("https://clanfear.github.io/ioc_iqa/_data/borough_pop_density.csv") london_subregion <- read_csv("https://clanfear.github.io/ioc_iqa/_data/london_subregion.csv") ``` ] --- # Taking a Look .text-85[ .pull-left[ ``` r names(metro_2021_crime) ``` ``` ## [1] "borough" ## [2] "month" ## [3] "antisocial_behaviour" ## [4] "burglary" ## [5] "robbery" ## [6] "violence_and_sexual_offences" ``` ``` r names(london_subregion) ``` ``` ## [1] "subregion" "borough" ``` ``` r names(borough_pop_density) ``` ``` ## [1] "Borough" "Pop" "Area" ``` ``` r names(borough_deprivation) ``` ``` ## [1] "borough" "deprivation" ## [3] "dep_score" ``` ] .pull-right[ ``` r dim(metro_2021_crime) ``` ``` ## [1] 384 6 ``` <p style="margin-bottom:2.25cm;"> </p> ``` r dim(london_subregion) ``` ``` ## [1] 32 2 ``` ``` r dim(borough_pop_density) ``` ``` ## [1] 32 3 ``` ``` r dim(borough_deprivation) ``` ``` ## [1] 32 3 ``` ] ] --- # Joining Concepts We can **join** data frames using functions from `{dplyr}` (part of `{tidyverse}`) -- Before we use them, we need to think about how we want to join the data frames: -- * Which rows are we keeping from each data frame? -- * Which columns are we keeping from each data frame? -- * Which variables determine whether rows match? --- # Types of Join Given two dataframes, `df_a` and `df_b`, there are *many* ways to join them -- The most commonly used: * **Left Join**: `df_a |> left_join(df_b)` * Keep all columns and rows from `df_a` * Keep all columns but only rows that match from `df_b` -- * **Inner Join**: `df_a |> inner_join(df_b)` * Keep all columns but only rows that match from `df_a` * Keep all columns but only rows that match from `df_b` -- * **Full Join**: `df_a |> full_join(df_b)` * Keep all columns and rows from `df_a` * Keep all columns and rows from `df_b` -- `NA` values will be created when joined columns have no data for a given row (e.g., in full or left joins) --- # Same Names If the joined data frames share some column names, they will automatically match on those columns ``` r metro_2021_crime |> left_join(london_subregion) |> glimpse() ``` ``` ## Rows: 384 ## Columns: 7 ## $ borough <chr> "Barking and Dagenham", "Bark… ## $ month <date> 2021-01-01, 2021-02-01, 2021… ## $ antisocial_behaviour <dbl> 688, 585, 600, 531, 469, 507,… ## $ burglary <dbl> 87, 105, 81, 85, 106, 90, 75,… ## $ robbery <dbl> 30, 50, 49, 40, 47, 46, 42, 3… ## $ violence_and_sexual_offences <dbl> 469, 449, 561, 564, 589, 617,… ## $ subregion <chr> "East", "East", "East", "East… ``` --- # Differing Names `borough_pop_density` has a different naming style (title case) If matching variable names differ, you can specify what left side variable (`"borough"`) matches the right side variable (`"Borough"`) ``` r metro_2021_crime |> left_join(borough_pop_density, by = c("borough"="Borough")) |> glimpse() ``` ``` ## Rows: 384 ## Columns: 8 ## $ borough <chr> "Barking and Dagenham", "Bark… ## $ month <date> 2021-01-01, 2021-02-01, 2021… ## $ antisocial_behaviour <dbl> 688, 585, 600, 531, 469, 507,… ## $ burglary <dbl> 87, 105, 81, 85, 106, 90, 75,… ## $ robbery <dbl> 30, 50, 49, 40, 47, 46, 42, 3… ## $ violence_and_sexual_offences <dbl> 469, 449, 561, 564, 589, 617,… ## $ Pop <dbl> 221495, 221495, 221495, 22149… ## $ Area <dbl> 36.1, 36.1, 36.1, 36.1, 36.1,… ``` --- # Many Joins, Handle It We can chain many joins in order—and here we `clean_names()` to fix the capitalized names ahead of time ``` r metro_2021 <- metro_2021_crime |> left_join(london_subregion) |> left_join(borough_deprivation) |> left_join(borough_pop_density |> clean_names()) head(metro_2021, 3) ``` ``` ## # A tibble: 3 × 11 ## borough month antisocial_behaviour burglary robbery ## <chr> <date> <dbl> <dbl> <dbl> ## 1 Barking and Dage… 2021-01-01 688 87 30 ## 2 Barking and Dage… 2021-02-01 585 105 50 ## 3 Barking and Dage… 2021-03-01 600 81 49 ## # ℹ 6 more variables: violence_and_sexual_offences <dbl>, ## # subregion <chr>, deprivation <chr>, dep_score <dbl>, pop <dbl>, ## # area <dbl> ``` --- # Some Clean-Up Let's do a little bit of data management while we're at it ``` r metro_2021 <- metro_2021 |> rename(violence = violence_and_sexual_offences, asb = antisocial_behaviour) |> * mutate(month = lubridate::month(month), pop_den = (pop/1000)/area) ``` Things done: * Shorten a couple names for ease of use * Convert `month` to an integer using `{lubridate}`<sup>1</sup> * Create a population density (`pop_den`) in thousands per `\(km^2\)` .footnote[[1] `{lubridate}` is a *great* package for working with dates and times] --- ## Aside: Scientific notation When a number is very close to zero, R will display it in scientific notation: ``` r 0.0000000032 ``` ``` ## [1] 3.2e-09 ``` This is because it is much more compact! -- You can adjust it using `scipen` in **options()** ``` r options(scipen = 10) 0.0000000032 ``` ``` ## [1] 0.0000000032 ``` R will "prefer" to display standard notation until the output exceeds the width of the `scipen` value --- # Model from Last Time We created a curve with a **quadratic** functional form of `month` .pull-left[ .text-85[ ``` r lm_sq <- lm(violence ~ month + I(month^2), data = metro_2021) lm_sq |> tidy() |> select(term, estimate, std.error) ``` ``` ## # A tibble: 3 × 3 ## term estimate std.error ## <chr> <dbl> <dbl> ## 1 (Intercept) 431. 33.6 ## 2 month 62.5 11.9 ## 3 I(month^2) -4.01 0.889 ``` ] ] .pull-right[ ![](slides_linear-models-iv_files/figure-html/unnamed-chunk-15-1.svg)<!-- --> ] -- This fits our data well, but has drawbacks: * Uses two parameters * Complicated interpretation -- .text-center[ *There are simpler—more restrictive—curves we can fit* ] --- class: inverse # Logarithms ![:width 55%](img/logs.png) --- # One Parameter Curve `log()` provides an alternative single parameter curve .pull-left[ .text-85[ ``` r lm_log <- lm(violence ~ log(month), data = metro_2021) lm_log |> tidy() |> select(term, estimate, std.error) ``` ``` ## # A tibble: 2 × 3 ## term estimate std.error ## <chr> <dbl> <dbl> ## 1 (Intercept) 513. 23.8 ## 2 log(month) 64.3 13.1 ``` ] ] .pull-right[ ![](slides_linear-models-iv_files/figure-html/unnamed-chunk-17-1.svg)<!-- --> ] -- In this case, it doesn't look like a great fit * These models are not **nested** so they cannot be compared with `anova()` * *We'll see a tool to compare these shortly* -- .text-center[ *But first, what is `log()` anyway?* ] --- # Quick Maths Exponentiation, `\(b^{n} = x\)`, multiplies a number (the **base**, `\(b\)`) by itself `\(n\)` times: `$$5^2 = 5*5 = 25$$` -- Roots, `\(\sqrt[n]{x} = b\)`, give the base you'd need to raise to power `\(n\)` to get `\(x\)`: `$$\sqrt[2]{25} = 25 / 5 = 5$$` -- Logarithms, `\(log_{b}(x) = n\)`, give the `\(n\)` to which you raise `\(b\)` to get `\(x\)`: `$$log_{5}(25) = 2$$` .text-center[ *To get 25, you raise 5 (the base) to the power of 2* ] -- In statistics and maths, we often use *Euler's number* (~2.718), `\(e\)`, as a base * `\(log_{e}(x)\)` is the **natural logarithm** and `\(e^n\)` is the **exponential function** -- * `log(x)` is `\(log_{e}(x)\)` or `\(ln(x)\)` and `exp(n)` is `\(e^n\)` -- * `log(exp(x))` and `exp(log(x))` are both equal to `x`! --- # Natural Logs .pull-left[ ``` r exp(1:3) ``` ``` ## [1] 2.718282 7.389056 20.085537 ``` ``` r curve(exp(x), from = 1, to = 3) ``` ![](slides_linear-models-iv_files/figure-html/unnamed-chunk-18-1.svg)<!-- --> ] -- .pull-right[ ``` r log(1:3) ``` ``` ## [1] 0.0000000 0.6931472 1.0986123 ``` ``` r curve(log(x), from = 1, to = 3) ``` ![](slides_linear-models-iv_files/figure-html/unnamed-chunk-19-1.svg)<!-- --> ] -- These are useful because they produce *curves* with simple derivatives: `$$\frac{d}{dx}e^x = e^x\quad\quad\quad\quad \frac{d}{dx}log_e(x) = \frac{1}{x}$$` --- # `log(x)` Interpretation ``` r lm_log |> coef() ``` ``` ## (Intercept) log(month) ## 513.19271 64.31455 ``` Natural logs allow for a special interpretation called an **elasticity** or **partial elasticity** -- When only an independent variable is logged, we call it a **level-log** (*partial elasticity*) model: * Divide the `\(x\)` coefficient by 100 and it is the amount `\(y\)` differs due to a `\(1\%\)` difference in `\(x\)`<sup>1</sup> * A `\(1\%\)` increase in `month`<sup>2</sup> is associated with .64 more `violence` .footnote[ [1] More precisely: `\(p\%\)` higher `\(x\)` is associated with `\(b * log(1 + \frac{p}{100})\)` higher `\(y\)` ] --- # `log(y)` Interpretation ``` r lm(log(violence) ~ month, data = metro_2021) |> coef() ``` ``` ## (Intercept) month ## 6.26043831 0.01777349 ``` You can also log the *outcome* -- When only the dependent variable is logged, we call it a **log-level** (*partial elasticity*) model: * Multiply `\(x\)` coefficient by 100 and it is the percent `\(y\)` differs due to a difference in `\(x\)`<sup>1</sup> * An increase of 1 in `month` is associated with `\(1.7\%\)` higher `violence` .footnote[ [1] This is approximate; the exact version is a `\((e^{b}-1)*100\%\)` difference in `\(y\)` ] --- # Log-log Models ``` r lm(log(violence) ~ log(month), data = metro_2021) |> coef() ``` ``` ## (Intercept) log(month) ## 6.1909002 0.1111105 ``` If we log both `\(x\)` and `\(y\)` we get a **log-log** (*elasticity*) model -- When both `\(x\)` and `\(y\)` are logged: * The `\(x\)` coefficient is the percent `\(y\)` differs due to a percent difference in `\(x\)` * A `\(1\%\)` increase in `month` is associated with a `\(0.11\%\)` increase in `violence` --- # Working with Logs By compressing higher values, they reduce *skew* * This tends to reduce **heteroskedasticity** * **Heteroskedasticity** is variance that differs across model predictions * It only impacts standard errors in standard linear models<sup>1</sup> * Correct with robust errors, functional form, or alternate estimators * Makes some models easier to estimate .footnote[ [1] It causes serious problems in many other types of models ] -- Some warnings: * You can only take a log of a positive (non-zero) number! * *Don't just add 1 then log it!* * The curve is *inflexible* * The percentage interpretation is an *approximation* * Inaccurate for changes in `\(x\)` much smaller or greater than `\(1\)` --- class: inverse # Moderation ## Using Interaction Terms .text-150[ > Moderation in all things is best, but it's pretty hard to get excited about it. — Mason Cooley ] --- # Moderation **Moderation** occurs when the association between `\(x\)` and `\(y\)` is different across values of another variable `\(z\)` -- This implies a **multiplicative** relationship rather than just an **additive** one * **Additive**: `\(y = b_1x + b_2z\)` * **Multiplicative**: `\(y = b_1x + b_2z + b_3\color{#003C71}{xz}\)` -- Note that we obey **hierarchy**: Include `\(x\)` and `\(z\)` alone when using `\(x*z\)`<sup>1</sup> .footnote[ [1] There are *rare* cases when you drop one or both first order terms ] -- .pull-left[ Note that moderation is not indicated by DAGs<sup>2</sup> Moderation is a type of *functional form* like polynomials or logarithms ] .pull-right[ ![](slides_linear-models-iv_files/figure-html/mod-dag-1.svg)<!-- --> ] .footnote[ [2] Arrows pointing to other arrows are used in path diagrams but *not* DAGs ] --- # Visual Moderation Moderation produces *different slopes* for `\(x\)` at each value of `\(z\)` -- For categorical `\(z\)`, this may be represented with a regression line for each level of `\(z\)` .pull-left[ ``` r ggplot(metro_2021 , aes(x = pop_den, y = violence, color = deprivation)) + geom_point() + geom_smooth(method = "lm") ``` ] .pull-right[ ![](slides_linear-models-iv_files/figure-html/int-plot-1.svg) ] -- `pop_den` has a negative slope only for "medium" `deprivation` boroughs * *Crime is lower when population density is higher in boroughs with moderate deprivation* * *Crime is higher when population density is higher in boroughs with low or high deprivation* --- # Moderation Formula We can fit a model with moderation by *multiplying variables* in the formula ``` r lm_int <- lm(violence ~ deprivation*pop_den, data = metro_2021) lm_int |> tidy() |> select(term, estimate, statistic) ``` ``` ## # A tibble: 6 × 3 ## term estimate statistic ## <chr> <dbl> <dbl> ## 1 (Intercept) 632. 15.0 ## 2 deprivationLow -205. -3.87 ## 3 deprivationMedium 318. 5.83 ## 4 pop_den 6.38 1.80 ## 5 deprivationLow:pop_den 3.96 0.546 ## 6 deprivationMedium:pop_den -37.5 -7.16 ``` -- By default, it will obey **hierarchy** and include the first-order terms --- # Interpretation Interactions need to be interpreted using (partial) derivatives like polynomials Our equation: `$$y = 632 -205*depLow + 318*depMed + 6.4*popDen + \\ 4.0*depLow*popDen - 37*depMed*popDen$$` -- The estimated effect of `pop_den` is the *partial derivative*: `$$\beta_{dpopDen} = \frac{dviolence}{dpopDen} = 6.4 + 4*depLow + -37*depMed$$` -- We can calculate the effect of `pop_den` when `deprivation` is... `\begin{align} \text{Low: } & \beta_{dpopDen} = 6.4 + 4*1 + -37*0 =\hphantom{-} 10.4\\ \text{Moderate: } & \beta_{dpopDen} = 6.4 + 4*0 + -37*1 = -30.6\\ \text{High: } & \beta_{dpopDen} = 6.4 + 4*0 + -37*0 =\hphantom{-} 6.4 \end{align}` --- # Specification Test You can test if an interaction is warranted, because the **additive** model is nested inside the interaction model: .text-85[ ``` r lm_noint <- lm(violence ~ deprivation + pop_den, data = metro_2021) anova(lm_int, lm_noint) ``` ``` ## Analysis of Variance Table ## ## Model 1: violence ~ deprivation * pop_den ## Model 2: violence ~ deprivation + pop_den ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 378 8227307 ## 2 380 9551609 -2 -1324301 30.422 0.0000000000005614 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- # Fitted Values .pull-left[ ``` r lm_int |> augment() |> ggplot( aes(x = pop_den, y = .fitted, color = deprivation)) + geom_point() + geom_line() ``` ] .pull-right[ ![](slides_linear-models-iv_files/figure-html/int-fitted-1.svg) ] The fitted values are still all along lines... ... but there's a different line for each level of `deprivation`! --- class: inverse # Wrap-Up ## No reading ## Last assignment * Sent out by Friday