class: center, top, title-slide .title[ # Linear Models II ] .subtitle[ ## IQA Lecture 5 ] .author[ ### Charles Lanfear ] .date[ ### 20 Nov 2024
Updated: 15 Nov 2024 ] --- # Today ## Making Predictions ## Comparing Models --- # Setup Like usual, let's start by loading the communities data and converting our categorical variables to factors with appropriate levels .text-85[ ``` r library(tidyverse) library(broom) communities <- read_csv("https://clanfear.github.io/ioc_iqa/_data/communities.csv") |> mutate(across(c(incarceration, disadvantage), ~ factor(., levels = c("Low", "Medium", "High")))) ``` ] Again, we'll use `tidyverse` and `broom` today. --- class: inverse # Two small asides .pull-left[ ![:width 50%](img/soundgarden.jpg) ] .pull-right[ ![:width 50%](img/soundgarden.jpg) ] --- # Help Files and Methods Remember you can get help with a function using `?`: ``` r ?tidy ``` -- Some functions, like `tidy()`, work differently when you give them different types of **arguments**. -- These different ways of working are called **methods**—and they're secretly *different functions* -- If you give `tidy()` output from `lm()` it actually runs `tidy.lm()`! ``` r ?tidy.lm # tidy() method for lm objects ``` -- `summary()` works this way too: ``` r ?summary.lm # summary method for lm objects ``` -- This is good to know when looking for help --- # Dagitty Let's say we have a DAG and we're not sure if we can identify the effect of `\(Z\)` on `\(Y\)` ![](slides_linear-models-ii_files/figure-html/sim-dag-1.svg)<!-- --> -- We could do it all by hand, writing down the paths and thinking it out. -- But thinking hurts brain, so instead we could also use [`dagitty` (http://www.dagitty.net/dags.html)](http://www.dagitty.net/dags.html) to determine if we can identify an effect. It can be a little finicky, but it works *really* well—and is a great learning tool! --- class: inverse # Prediction ![:width 40%](img/nick-cage.PNG) --- # Fitted Values A regression model's predictions for each unit are called **fitted values** -- These are the `\(Y\)` values on the regression line given each unit's predictors -- .pull-left[ ``` r lm(crime_rate ~ pop_density, data = communities) |> * augment() |> # add .fitted * slice(1) |> # first row select(.fitted, pop_density) ``` ``` ## # A tibble: 1 × 2 ## .fitted pop_density ## <dbl> <dbl> ## 1 35.4 18.2 ``` ] .pull-right[ ![](slides_linear-models-ii_files/figure-html/unnamed-chunk-6-1.svg)<!-- --> ] Two tricks here: * `augment()` returns the original data with a new `.fitted` column * `slice(1L)` is `filter()` using *row numbers* (e.g., give me only the *first row*) --- # Getting Fitted Values We can get **fitted values** using `fitted()` or `broom::augment()` ``` r ex_lm <- lm(crime_rate ~ pop_density + disadvantage, data = communities) fitted(ex_lm) |> head(4) # Returns them as a vector ``` ``` ## 1 2 3 4 ## 32.873785 43.959257 8.250988 24.512137 ``` -- ``` r augment(ex_lm) |> head(4) # Returns them added to original data ``` ``` ## # A tibble: 4 × 9 ## crime_rate pop_density disadvantage .fitted .resid .hat .sigma ## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> ## 1 28.6 18.2 Low 32.9 -4.24 0.0122 12.8 ## 2 58.6 21.4 Medium 44.0 14.6 0.0145 12.7 ## 3 14.2 10.0 Low 8.25 5.95 0.0111 12.7 ## 4 22.3 14.9 Medium 24.5 -2.22 0.0103 12.8 ## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl> ``` --- # Fitted Values ``` r lm(crime_rate ~ pop_density, data = communities) |> augment() |> ggplot(aes(x = pop_density, y = .fitted)) + geom_point() + geom_smooth(method = "lm") + theme_minimal() ``` ![](slides_linear-models-ii_files/figure-html/unnamed-chunk-9-1.svg)<!-- --> --- # Where They Come From .pull-left[ ``` r ex_lm |> # getting coefficients tidy() |> select(term, estimate) ``` ``` ## # A tibble: 4 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -22.1 ## 2 pop_density 3.02 ## 3 disadvantageMedium 1.44 ## 4 disadvantageHigh 5.52 ``` ] .pull-right[ ``` r communities[1, c(2, 4)] ``` ``` ## # A tibble: 1 × 2 ## pop_density disadvantage ## <dbl> <fct> ## 1 18.2 Low ``` ``` r fitted(ex_lm)[1] ``` ``` ## 1 ## 32.87378 ``` ] -- Our regression formula: `$$\hat{y} = -22.1 + \color{#003C71}{3.02*PopDen} + \color{#8A1538}{1.44*DisMed} + \color{#880088}{5.52*DisHigh}$$` -- Multiplying through: `$$32.87 = -22.1 + \color{#003C71}{3.02*18.2} + \color{#8A1538}{1.44*0} + \color{#880088}{5.52*0}$$` --- # Counterfactual Values Sometimes we want to make a prediction using a set of **counterfactual** values—values we choose rather than from a real observation -- We can do this using the equation: `$$\hat{y} = -22.1 + \color{#003C71}{3.02*PopDen} + \color{#8A1538}{1.44*DisMed} + \color{#880088}{5.52*DisHigh}$$` -- What is our prediction for the average crime rate in a high disadvantage place with 10 population density? -- Plug in, multiply through: `$$13.62 = -22.1 + \color{#003C71}{3.02*10} + \color{#8A1538}{1.44*0} + \color{#880088}{5.52*1}$$` -- We can do this with `predict()` and the `newdata =` argument: ``` r predict(ex_lm, newdata = data.frame(pop_density = 10, disadvantage = "High")) ``` ``` ## 1 ## 13.69857 ``` .pull-right[ .footnote[ Only different due to rounding! ] ] --- # Interpreting Intercepts This highlights what the intercept means: `$$-22.1 = -22.1 + 3.02*0 + 1.44*0 + 5.52*0$$` The expected crime rate for an area with *0 population density* and *low disadvantage* (i.e., not medium and not high) -- ``` r predict(ex_lm, newdata = data.frame(pop_density = 0, disadvantage = "Low")) ``` ``` ## 1 ## -22.0631 ``` -- The intercept alone is often not a useful parameter to interpret Crime rates can't be *negative*! -- ``` r communities |> pull(pop_density) |> min() ``` ``` ## [1] 5.207884 ``` .pull-right[ .footnote[ There aren't even any areas near 0! ] ] --- class: inverse # Comparing Models ![:width 80%](img/comparing-models.jpg) --- # Specification How you set a model up is called the model's **specification** -- It includes: * The variables you include in your statistical model * DAGs (and theory) help us with this! * We'll mess with this today -- * The functional forms of those variables * DAGs don't help us with this (but theory can) * We'll mess with functional forms *next week* -- There are two tools to help us with this: -- * **Fit statistics**: Numbers that indicate how well a model fits the data -- * **Specification tests**: Hypothesis tests to check if our model gets "better" (or "worse") when we make changes -- .text-center[ *So what does it mean to be better or worse? Or to fit well?* ] --- # `\(R^2\)` The `\(R^2\)` value is a **fit statistic** that indicates the proportion of variation in outcome explained by the model -- * Values closer to 1 mean the model explains more -- * Values closer to 0 mean it explains less -- `broom::glance()` is a quick way to get `\(R^2\)` (and some other fit statistics) ``` r lm_1 <- lm(crime_rate ~ disadvantage, data = communities) glance(lm_1) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0619 0.0555 21.2 9.79 7.61e-5 2 -1340. 2688. ## # ℹ 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, ## # nobs <int> ``` `disadvantage` alone is only explaining 6.2% of the variation --- # Comparing `\(R^2\)` ``` r lm_2 <- lm(crime_rate ~ disadvantage + pop_density, data =communities) lm_3 <- lm(crime_rate ~ disadvantage + pop_density + area, data = communities) *rbind(glance(lm_1), # rbind() stacks output as rows glance(lm_2), glance(lm_3)) |> select(r.squared) ``` ``` ## # A tibble: 3 × 1 ## r.squared ## <dbl> ## 1 0.0619 ## 2 0.662 ## 3 0.663 ``` -- Adding `pop_density` explained a ton of the variation—but `area` didn't add much to that. * We went from 6.2%, to 66.2%, to 66.3% explained variation * This suggests `pop_density` is a strong predictor! --- # Limitations of `\(R^2\)` `\(R^2\)` has drawbacks: * Adding variables *always* increases it, even if they're irrelevant * You can't *reduce* the explained variation by adding variables! -- * Some things are harder or easier to predict * `\(R^2 = 0.2\)` is great in some applications * `\(R^2 = 0.8\)` is mediocre in others -- * It doesn't give us a *decision rule* on whether one specification is better than another * i.e., how much it must increase to be worth including a new variable -- * A true `\(R^2\)` doesn't *exist* for some types of models * We won't see these until the end of the course! --- # Why not include everything? Theory should be the first guide on whether to include (or omit) variables -- But, all else equal, simple models have advantages: -- * More statistical power / smaller standard errors<sup>1</sup> .footnote[[1] Including variables that are completely *independent* of treatment can *increase* statistical power, however!] -- * Easier to interpret * What does an effect of `\(X\)` on `\(Y\)` conditional on `\(Z, W, K, J,\)` and `\(P\)` *mean*? -- * Simple models are often more robust * Less likely to accidentally control for a front door or collider -- **Parsimony** is generally considered a good goal for inference **Parsimonious** models are ones that use fewer (or as few as possible) variables to accomplish their goal --- # (Multi)collinearity **Collinearity** is one practical reason to prefer parsimony -- **(Multi)collinearity** occurs when two or more independent variables (predictors) are *highly correlated* with each other -- ‌Consequences: * The model has difficulty separating their effects * They're explaining the same variation in the outcome! * You get imprecise estimates with large standard errors -- For example, if `\(y ~= b_1x + b_2z\)` but `\(r(x,z) = 0.9\)` * `\(b_1\)` and `\(b_2\)` won't be biased * On *average* they'll be correct * `\(b_1\)` and `\(b_2\)` will be *unstable* * Standard errors will be large * Estimate magnitudes will be erratic --- # Collinearity Example ``` r cor_mat <- matrix(c(1, 0.9, 0.3, 0.9, 1, 0.3, 0.3, 0.3, 1), nrow = 3) ex_data <- MASS::mvrnorm(n = 250, rep(0, 3), cor_mat) |> data.frame() |> setNames(c("x", "z", "y")) # add column names cor(ex_data) # correlation matrix ``` ``` ## x z y ## x 1.0000000 0.8933791 0.4221414 ## z 0.8933791 1.0000000 0.3680623 ## y 0.4221414 0.3680623 1.0000000 ``` -- ``` r lm(y ~ x + z, data = ex_data) |> tidy() ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.0841 0.0623 -1.35 0.179 ## 2 x 0.502 0.139 3.60 0.000382 ## 3 z -0.0520 0.149 -0.350 0.727 ``` This instability can even *flip signs*! --- # Adjusted `\(R^2\)` The **adjusted `\(R^2\)`** favours parsimony by *penalizing* the `\(R^2\)` for every additional parameter (coefficient) -- ``` r rbind(glance(lm_1), glance(lm_2), glance(lm_3)) |> select(r.squared, adj.r.squared) ``` ``` ## # A tibble: 3 × 2 ## r.squared adj.r.squared ## <dbl> <dbl> ## 1 0.0619 0.0555 ## 2 0.662 0.658 ## 3 0.663 0.659 ``` The difference for the first model is because `disadvantage` in `lm_1` adds *two* parameters ("Medium" and "High"), which causes a penalty -- Some drawbacks: * Still doesn't give us a hard rule on whether to use a variable * No longer can be interpreted as proportion of variance explained --- # Information riteria Another set of fit statistics that encourage parsimony are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) ``` r rbind(glance(lm_1), glance(lm_2), glance(lm_3)) |> select(AIC, BIC) ``` ``` ## # A tibble: 3 × 2 ## AIC BIC ## <dbl> <dbl> ## 1 2688. 2703. ## 2 2384. 2402. ## 3 2384. 2407. ``` * *Lower* values of information criteria are better * BIC is more commonly used and penalizes extra parameters more than AIC * Rule of thumb: ~4 suggests notably better fit --- # Specification Tests **Specification tests** are hypothesis tests for evaluating how well a model fits -- They can be used to test if... * One model fits better than another -- * Coefficients are equal to a value (e.g., 0) or to each other * The t-test in regression output is doing this! * Useful if you're considering *combining* two variables -- * Many other assumptions are violated -- .text-center[ *We'll do all of these things by the end of the term, but for now, let's focus on **comparing models** * ] --- # `anova()` for Comparisons The `anova()` function can be used to compare linear models ``` r anova(lm_2, lm_3) ``` ``` ## Analysis of Variance Table ## ## Model 1: crime_rate ~ disadvantage + pop_density ## Model 2: crime_rate ~ disadvantage + pop_density + area ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 296 47986 ## 2 295 47748 1 238.59 1.4741 0.2257 ``` This produces a chi-square test of whether a given specification significantly explains more variance (sum of squares)—that is, if it *fits better* -- Here `area` does *not* statistically significantly improve the model -- This statistical test requires one model be **nested** in the other --- # Nested Models Models are nested if every variable in the *less complex* model is included in the *more complex model* -- When models are nested, we say... * The less complex one is the **restricted model** * The more complex one is the **unrestricted model** -- *What exactly is **restricted**?* -- Excluding a variable from a model is equivalent to settings its *coefficient to zero* * i.e., we *restrict the coefficient to zero* when we omit that variable -- When we exclude a path from a DAG, we *assume a coefficient of zero for that path* * We can test this assumption using `anova()` and other specification tests * e.g., if our DAG indicated no direct effect of `area` on `crime_rate`, the prior test would support that assumption --- # Another Example ``` r anova(lm_1, lm_2, lm_3) ``` ``` ## Analysis of Variance Table ## ## Model 1: crime_rate ~ disadvantage ## Model 2: crime_rate ~ disadvantage + pop_density ## Model 3: crime_rate ~ disadvantage + pop_density + area ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 297 133081 ## 2 296 47986 1 85095 525.7421 <2e-16 *** ## 3 295 47748 1 239 1.4741 0.2257 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` We can give `anova()` any number of models * Best to start with the *least complex* or *most restricted* * Each model is compared to the one directly above it --- # Warning Adding a collider to a model will almost always improve fit * `anova()` only tests if the model fits better * `anova()` doesn't know about your DAG -- .text-85[ ``` r lm_4 <- lm(crime_rate ~ disadvantage + incarceration, data = communities) anova(lm_1, lm_4) ``` ``` ## Analysis of Variance Table ## ## Model 1: crime_rate ~ disadvantage ## Model 2: crime_rate ~ disadvantage + incarceration ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 297 133081 ## 2 295 117321 2 15760 19.814 8.427e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] -- **Theory** is your first guide for what to include or not This test is mainly to see if it is okay to exclude *potential confounder* --- class: inverse # Wrap-Up Next week: * **Make-up lecture on Monday at 2-4 PM** Reading: * Huntington-Klein, N. (2022) *The Effect: An Introduction to Research Design and Causality*, New York, NY: Chapman and Hall/CRC Press. * Read Chapter 12: Opening the Toolbox and Chapter 13: Regression only up to (but not including) 13.2.2 Polynomials