class: center, top, title-slide .title[ # Linear Models I ] .subtitle[ ## IQA Lecture 4 ] .author[ ### Charles Lanfear ] .date[ ### 6 Nov 2024
Updated: 6 Nov 2024 ] --- # Today * Programming: * `across()` * Linear models: * Categorical predictors * Binary: T-Test * 3+ category: ANOVA * Controlling for variables * Demeaning with dummies * Residuals * Partial correlations (is that pairs plot?) * How linear models work * Simulating data * Back doors * Colliders --- # Setup Like usual, let's start by loading the communities data First, let's convert our categorical variables to factors with appropriate levels .text-85[ ``` r library(tidyverse) library(broom) # Going to use this a bunch today communities <- read_csv("https://clanfear.github.io/ioc_iqa/_data/communities.csv") |> mutate(across(c(incarceration, disadvantage), ~ factor(., levels = c("Low", "Medium", "High"))), area = factor(area, levels = c("Rural", "Urban"))) ``` ] We can use `across()` as a shortcut to perform the same operation on multiple variables -- This is the same as doing this: .text-85[ ``` r mutate(incarceration = factor(incarceration, levels = c("Low", "Medium", "High")), disadvantage = factor(disadvantage, levels = c("Low", "Medium", "High"))) ``` ] --- # Another `across()` You can also give `across()` a list of functions and it will apply them to each variable ``` r communities |> group_by(incarceration) |> summarize(across(c(crime_rate, pop_density), list(mu = ~mean(.), sd = ~sd(.)))) ``` ``` ## # A tibble: 3 × 5 ## incarceration crime_rate_mu crime_rate_sd pop_density_mu ## <fct> <dbl> <dbl> <dbl> ## 1 Low 17.7 16.1 12.8 ## 2 Medium 20.9 18.0 14.2 ## 3 High 37.0 25.3 17.5 ## # ℹ 1 more variable: pop_density_sd <dbl> ``` This calculated a `mean()` and `sd()` each for `crime_rate` and `pop_density` within each level of `incarceration` with very little code! --- class: inverse # Linear Models ### Different predictors --- # Continuous .pull-left[ ``` r lm(crime_rate ~ pop_density, data = communities) |> coef() # extract coefficients ``` ``` ## (Intercept) pop_density ## -20.776945 3.093886 ``` We've seen a lot of linear models with continuous predictors It is easy to visualize what the model is doing with a line laid over a scatterplot ] .pull-right[ ``` r ggplot(communities, aes(x = pop_density, y = crime_rate)) + geom_point() + geom_smooth(method = "lm") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-5-1.svg)<!-- --> ] --- # Dummy Variables What is a regression model doing when you have a binary predictor? -- .pull-left[ ``` r lm(crime_rate ~ area, data = communities) |> coef() ``` ``` ## (Intercept) areaUrban ## 10.43129 30.18226 ``` The data look weird and `geom_smooth()` doesn't even do anything! Why do the data look like this? ] .pull-right[ ``` r ggplot(communities, aes(x = area, y = crime_rate)) + geom_point() + geom_smooth(method = "lm") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-7-1.svg)<!-- --> ] --- # A Slight Modification .pull-left[ ``` r lm(crime_rate ~ area, data = communities) |> coef() ``` ``` ## (Intercept) areaUrban ## 10.43129 30.18226 ``` It turns out the linear model actually treats a binary variable as a numeric one that takes values of 0 or 1. When `areaUrban` is 0 (`area` is `"Rural"`) the mean `crime_rate` is `\(10.43\)` When `areaUrban` is 1 (`area` is `"Urban"`) the mean is `\(10.43 + 30.18 = 40.61\)` ] .pull-right[ ``` r communities |> mutate(urban = as.numeric(area=="Urban")) |> ggplot(aes(x = urban, y = crime_rate)) + geom_point() + geom_smooth(method = "lm") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-9-1.svg)<!-- --> ] --- # Mean Differences .pull-left[ Linear regression is just estimating means by group! ``` r dummy_model <- lm(crime_rate ~ area, data = communities) dummy_model |> coef() ``` ``` ## (Intercept) areaUrban ## 10.43129 30.18226 ``` The intercept is the average crime rate for rural places The value for `areaUrban` is the **difference** between rural and urban places ] .pull-right[ Compare the model to group-specific average crime rates: ``` r communities |> group_by(area) |> summarize(crime_rate = mean(crime_rate)) ``` ``` ## # A tibble: 2 × 2 ## area crime_rate ## <fct> <dbl> ## 1 Rural 10.4 ## 2 Urban 40.6 ``` `\(\text{Urban} = 40.61 = 10.43 + 30.18\)` ] These are saying the exact same thing! --- # Hypothesis Testing Significance testing with dummy variables works like any others .text-85[ ``` r summary(dummy_model)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.43129 1.270202 8.21231 6.660396e-15 ## areaUrban 30.18226 1.814574 16.63325 2.123665e-44 ``` ] -- This is *identical* to a two-sample t-test! .text-85[ ``` r t.test(crime_rate ~ area, data = communities, var.equal=TRUE) ``` ``` ## ## Two Sample t-test ## ## data: crime_rate by area ## t = -16.633, df = 298, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group Rural and group Urban is not equal to 0 ## 95 percent confidence interval: ## -33.75326 -26.61125 ## sample estimates: ## mean in group Rural mean in group Urban ## 10.43129 40.61355 ``` ] .pull-right-40[ .footnote[ This is a significance test for a difference in group means—the same as our dummy variable model! ] ] --- # More Categories When a variable has more than two categories, all but one category is converted to a mutually exclusive binary variable .pull-left[ ``` r lm(crime_rate ~ incarceration, data = communities) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 17.7 ## 2 incarcerationMedium 3.22 ## 3 incarcerationHigh 19.3 ``` ] .pull-right[ ``` r communities |> group_by(incarceration) |> summarize(crime_rate = mean(crime_rate)) ``` ``` ## # A tibble: 3 × 2 ## incarceration crime_rate ## <fct> <dbl> ## 1 Low 17.7 ## 2 Medium 20.9 ## 3 High 37.0 ``` ] Here our intercept represents the mean when `incarceration` is neither `"Medium"` nor `"High"`. Our **reference category** is `"Low"` --- # ANOVA Some of you may be familiar with ANOVA for analysing how 3+ category variables relate to a continuous outcome .text-85[ ``` r aov(crime_rate ~ incarceration, data = communities) |> summary() ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## incarceration 2 20991 10496 25.79 4.7e-11 *** ## Residuals 297 120866 407 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] -- Turns out ANOVA is just a standard linear model with a categorical variable ``` r aov(crime_rate ~ incarceration, data = communities) |> coef() ``` ``` ## (Intercept) incarcerationMedium incarcerationHigh ## 17.726374 3.217159 19.265125 ``` -- Until you get to very advanced methods (i.e., mixed effects models), ANOVA is just a less flexible linear model with more complicated terminology --- # As an `lm()` .text-85[ ``` r summary(lm(crime_rate ~ incarceration, data = communities)) ``` ``` ## ## Call: ## lm(formula = crime_rate ~ incarceration, data = communities) ## ## Residuals: ## Min 1Q Median 3Q Max ## -34.553 -14.044 -5.796 11.041 73.811 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17.726 2.070 8.565 5.97e-16 *** ## incarcerationMedium 3.217 2.850 1.129 0.26 ## incarcerationHigh 19.265 2.897 6.649 1.41e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 20.17 on 297 degrees of freedom ## Multiple R-squared: 0.148, Adjusted R-squared: 0.1422 ## F-statistic: 25.79 on 2 and 297 DF, p-value: 4.7e-11 ``` ] .footnote[ Note the F-statistic for the regression is the F value for the ANOVA ] --- class: inverse # Controls ![:width 50%](img/yoda.gif) --- # Controlling for Variables Remember that we control for variables for the purpose of **identifying** something we're interested in -- In causal research, this is some **treatment effect**. * In the following examples, we'll assume we want to identify `\(X \rightarrow Y\)` -- In other cases, it may just be some association we're interested in. -- To know if a control does what you want (e.g., closes a back door), it is important to understand: * What including that control implies for your model * Exactly what including it really does numerically --- # Categorical Controls We include a categorical control `\(Z\)` if we believe some of `\(X\)` and `\(Y\)`'s apparent relationship is due to *some `\(Z\)` groups having different average levels of X and Y* -- If we remove group-specific averages of both X and Y, we can get rid of that .text-85[ ``` r residualized_data <- communities |> group_by(area) |> mutate(pop_density_res = pop_density - mean(pop_density), #<< # Subtract mean! * crime_rate_res = crime_rate - mean(crime_rate)) ``` ] -- .pull-left[ .text-85[ ``` r lm(crime_rate ~ pop_density + area, data = communities) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -23.0 ## 2 pop_density 3.36 ## 3 areaUrban -3.39 ``` ] ] .pull-right[ .text-85[ ``` r lm(crime_rate_res ~ pop_density_res, data = residualized_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 4.03e-15 ## 2 pop_density_res 3.36e+ 0 ``` ] ] Note the identical `pop_density` slopes! --- ## Visualising It .pull-left[ ``` r ggplot(communities, aes(x = pop_density, y = crime_rate, color = area)) + geom_point() + geom_smooth(method = "lm", color = "black") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-22-1.svg)<!-- --> ] .pull-right[ ``` r ggplot(residualized_data, * aes(x = pop_density_res, * y = crime_rate_res, color = area))+ geom_point() + geom_smooth(method = "lm", color = "black") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-23-1.svg)<!-- --> ] Rural and urban data points now occupy a similar area due to subtracting their average `\(X\)` and `\(Y\)` values—this gives us their **residuals** --- # Residuals Residuals are the difference between model predictions and the actual data .text-85[ ``` r lm_res <- lm(crime_rate ~ pop_density, data = communities) |> augment() ``` ] .pull-left[ .text-85[ ``` r ggplot(lm_res, aes(x = pop_density, y = crime_rate)) + geom_point() + geom_smooth(method = "lm") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-25-1.svg)<!-- --> ] ] .pull-right[ .text-85[ ``` r ggplot(lm_res, # augment() aes(x = pop_density, # gives us * y = .resid)) + # residuals! geom_point() + geom_smooth(method = "lm") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-26-1.svg)<!-- --> ] ] -- .text-center[ The residuals are what our model *doesn't explain* ] --- # `car::avPlot()` This type of plot is sometimes called an **added variable plot** .pull-left[ ``` r ggplot(residualized_data, aes(x = pop_density_res, y = crime_rate_res))+ geom_point() + geom_smooth(method = "lm") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-27-1.svg)<!-- --> ] -- .pull-right[ ``` r lm(crime_rate ~ pop_density + area, data = communities) |> car::avPlot( variable = "pop_density") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-28-1.svg)<!-- --> ] These show the regression line after *controlling for other variables*—and unlike `ggplot2`, `avPlot()` can accept more than one added variables at once --- # Continuous Variables We include continuous controls if some of `\(X\)` and `\(Y\)`'s relationship is because `\(Z\)` makes `\(X\)` and `\(Y\)` higher/lower—sometimes called **spuriousness** -- If we remove variation in `\(X\)` and `\(Y\)` due to `\(Z\)`, we can fix this -- Here we use bivariate models predicting what we want to residualise .text-85[ ``` r residualized_data <- communities %>% mutate(area_num = as.numeric(area == "Urban"), * crime_rate_res = residuals(lm(crime_rate ~ pop_density, data = .)), * area_res = residuals(lm(area_num ~ pop_density, data = .))) ``` ] -- .pull-left[ .text-85[ ``` r lm(crime_rate ~ pop_density + area, data = communities) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -23.0 ## 2 pop_density 3.36 ## 3 areaUrban -3.39 ``` ] ] .pull-right[ .text-85[ ``` r *lm(crime_rate_res ~ area_res, * data = residualized_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 7.77e-17 ## 2 area_res -3.39e+ 0 ``` ] ] --- ## Visualising It .pull-left[ Original data: ``` r ggplot(residualized_data, aes(x = area_num, y = crime_rate, color = area)) + geom_point() + geom_smooth(method = "lm", color = "black") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-32-1.svg)<!-- --> ] .pull-right[ Residuals of `\(X\)` and `\(Y\)`: ``` r ggplot(residualized_data, * aes(x = area_res, * y = crime_rate_res, color = area))+ geom_point() + geom_smooth(method = "lm", color = "black") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-33-1.svg)<!-- --> ] -- .text-center[ *Wait, wat? I thought `area` was binary?* ] --- # wat Yep, that's not a mistake—our model predicted *how urban* each point was and the residual is the *unexpected level of urban* on a 0 to 1 scale—weird! .pull-left[ ``` r ggplot(residualized_data, * aes(x = area_res, * y = crime_rate_res, color = area))+ geom_point() + geom_smooth(method = "lm", color = "black") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-34-1.svg)<!-- --> ] .pull-right[ ``` r lm(crime_rate ~ pop_density + area, data = communities) |> car::avPlot( variable = "areaUrban") ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-35-1.svg)<!-- --> ] --- class: inverse # How It Works ![:width 80%](img/trebuchet.jpg) --- # Generating Data If you don't like doing mathematical proofs to figure out if something works, you can often test things with simulation -- `rnorm()` and similar functions randomly generate data *drawn* from a particular distribution .pull-left[ ``` r sample_size <- 10000 sim_data <- tibble(bd = * rnorm(sample_size, * mean = 3, sd = 1)) sim_data |> summarize(bd_mean = mean(bd), bd_sd = sd(bd)) ``` ``` ## # A tibble: 1 × 2 ## bd_mean bd_sd ## <dbl> <dbl> ## 1 2.98 1.01 ``` ] -- .pull-right[ ``` r ggplot(sim_data, aes(x = bd)) + geom_density() ``` ![](slides_linear-models-i_files/figure-html/unnamed-chunk-37-1.svg)<!-- --> ] --- # Simulation Now we can add other variables to our data to generate a dataset that obeys the DAG on the right .pull-left-60[ .text-85[ ``` r sim_data <- sim_data |> mutate( x = rnorm(n(), 0.3*bd, 1), * y = rnorm(n(), 3 + 0.3*x + 0.3*bd, 1), coll = rnorm(n(), -0.2*x + -0.2*y, 1)) ``` ] ] .pull-right-40[ ![](slides_linear-models-i_files/figure-html/sim-dag-1.svg)<!-- --> ] -- "Y is equal to 3 plus `\(0.3*X\)` plus `\(0.3*BD\)`, with about 1 SD of error on average" -- Note that we generate the variables in causal order of the diagram * We started with `bd` (on prior slide) * We finished with `coll` --- # Estimates with Back Doors Because we know the **data generating process**—we created it—we know what our estimates should be .pull-left[ ``` r *lm(y ~ x + bd, data = sim_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 3.03 ## 2 x 0.299 ## 3 bd 0.290 ``` ] -- .pull-right[ ``` r *lm(y ~ x , # Leaving out bd data = sim_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 3.83 ## 2 x 0.379 ``` ] Excluding `bd` makes the effect of `x` larger than it should be—it is **magnified** -- This occurs whenever a back door has effects with the *same sign* on both `\(X\)` and `\(Y\)` (e.g., both positive)—opposite signs would **attenuate** the estimate --- # Estimates with Colliders Again, we know what effects should be identified and how to identify them. What if we add the collider? .pull-left[ ``` r *lm(y ~ x + bd, data = sim_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 3.03 ## 2 x 0.299 ## 3 bd 0.290 ``` ] -- .pull-right[ ``` r *lm(y ~ x + bd + coll, data = sim_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 4 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 2.91 ## 2 x 0.250 ## 3 bd 0.282 ## 4 coll -0.184 ``` ] This **attenuates** `\(X\)` but does nothing to `\(BD\)`—it isn't a collider for `\(BD\)`! .text-center[ *Let's look at a more severe example* ] --- # Selection (Collider) Bias Let's create the data from CRM yesterday. ``` r coll_data <- tibble( Skill = runif(500, 0, 10), Frequency = runif(500, 0, 10), Arrested = Frequency > Skill) arrested <- coll_data |> filter(Arrested) |> mutate(Who = "Arrested") everyone <- coll_data |> mutate(Who = "Everyone") ``` What's going on here? * *No relationship* between `Skill` at crime and `Frequency` of crime * Just two uniformly distributed variables * Anyone who commits more crime than their skill gets arrested * Only these people are in `arrested` -- If we only see arrested people, we've *selected* on a collider—this is just like including one in a model --- # Collider Model .pull-left[ ``` r *lm(Frequency ~ Skill, data = everyone) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 4.45 ## 2 Skill 0.0845 ``` ] -- .pull-right[ ``` r *lm(Frequency ~ Skill, data = arrested) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 4.40 ## 2 Skill 0.627 ``` ] -- We know there is no real relationship, but because we **selected** data based on an outcome of Frequency and Skill, it creates a **spurious** relationship -- .text-center[ *The collider is still producing bias without being explicitly in the model!* ] --- ## What about controlling? .pull-left[ Controlling for arrest: ``` r *lm(Frequency ~ Skill + Arrested, data = everyone) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -0.382 ## 2 Skill 0.567 ## 3 ArrestedTRUE 4.98 ``` ] .pull-right[ Only arrested people: ``` r *lm(Frequency ~ Skill, data = arrested) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 4.40 ## 2 Skill 0.627 ``` ] .text-center[ *We get bias either way!* ] --- # Visualizing Collider Bias ``` r bind_rows(arrested, everyone) |> ggplot(aes(x = Skill, y = Frequency, color = Who)) + facet_wrap(~Who) + geom_point() + geom_smooth(method = "lm", color = "black") + labs(x = "Skill at Crime", y = "Frequency of Crime") + theme_minimal(base_size = 16) + theme(legend.position = "none") ``` ![](slides_linear-models-i_files/figure-html/coll-plot-1.svg)<!-- --> --- class: inverse # Randomization --- # Non-Randomized Treatment Let's create example data with actual **potential outcomes** ``` r n <- 10000 po_data <- tibble( bd = runif( n, 0, 1), # Random uniform variable x = rbinom(n, 1, bd), # Random binary variable y0 = rnorm( n, 2*bd, 1), # untreated outcome * y1 = rnorm( n, 2*bd + 1, 1)) |> # treated outcome mutate(y = ifelse(x==1, y1, y0)) # Treatment just selects outcome ``` * `x` is a binary **treatment** with effect size of **1** * `y0` is the outcome if the unit is **untreated** * `y1` is the outcome if the unit is **treated** * `bd` is a backdoor predicting treatment *and* `y` * `bd` improves both potential outcomes *the same* amount * `bd` also makes treatment more likely .text-center[ *Units that get treated tend to have higher potential outcomes* ] --- # Estimation Like usual, if we can *see* the back door, we can identify the effect of `\(X\)` on `\(Y\)` .pull-left[ ``` r lm(y ~ x + bd, data = po_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 0.00277 ## 2 x 1.02 ## 3 bd 1.97 ``` ] .pull-right[ ``` r lm(y ~ x, data = po_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 0.663 ## 2 x 1.68 ``` ] -- .text-center[ *But even if we can't, we can randomize it away* ] -- `sample()` is an easy way to randomly select units ``` r sample(0:1, 20, replace = TRUE) # randomly select 0 or 1 20 times ``` ``` ## [1] 1 1 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 ``` --- # Random Assignment Let's select random units to receive the treatment ``` r po_data <- po_data |> mutate(treat = sample(0:1, n(), replace=TRUE), yt = ifelse(treat==1, y1, y0)) ``` All the treatment does is determine which potential outcome we see -- .pull-left[ ``` r lm(y ~ x + bd, data = po_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 0.00277 ## 2 x 1.02 ## 3 bd 1.97 ``` ] .pull-right[ ``` r lm(yt ~ treat, data = po_data) |> tidy() |> select(term, estimate) ``` ``` ## # A tibble: 2 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 0.993 ## 2 treat 1.02 ``` ] .text-center[ *Randomization closes back doors without adjustment!* ] --- # Randomization DAGs Randomization breaks back doors—even when we can't observe the responsible variables! .pull-left[ Confounded by an unobservable `\(BD\)` ![](slides_linear-models-i_files/figure-html/bd-dag-1.svg)<!-- --> ] .pull-right[ `\(BD\)` is irrelevant if we can randomize! ![](slides_linear-models-i_files/figure-html/ran-dag-1.svg)<!-- --> ] -- If we're interested in the effect of `\(X\)` here, we have to assume `\(T\)` is roughly equivalent to `\(X\)` (it is **consistent** with `\(X\)`) --- class: inverse # Wrap-Up ‍Reminder: Class on 13th **rescheduled** to Monday the 24th ‍Assignment: * Posted tomorrow morning: * On website in units (both week 4 and 5) * Read instructions carefully * No trick questions intended * If you're writing a ton of code, you may be overthinking it * Doc types: * Word or equivalent + .R script file for code * Use comments to indicate question code applies to * Quarto / Rmarkdown if you're feeling fancy * Due 11:59 PM Friday 15 November * **Do not spend that entire time working on it** * **Do ask questions** No reading until next week