Linear Models I

class: center, top, title-slide

.title[
# Linear Models I
]
.subtitle[
## IQA Lecture 4
]
.author[
### Charles Lanfear
]
.date[
### 6 Nov 2024<br>Updated: 6 Nov 2024
]

---

# Today

* Programming:
   * `across()`
* Linear models:
   * Categorical predictors
      * Binary: T-Test
      * 3+ category: ANOVA
   * Controlling for variables
      * Demeaning with dummies
      * Residuals
      * Partial correlations (is that pairs plot?)
   * How linear models work
      * Simulating data
      * Back doors
      * Colliders

---

# Setup

Like usual, let's start by loading the communities data

First, let's convert our categorical variables to factors with appropriate levels

.text-85[

``` r
library(tidyverse)
library(broom) # Going to use this a bunch today
communities <- 
  read_csv("https://clanfear.github.io/ioc_iqa/_data/communities.csv") |>
  mutate(across(c(incarceration, disadvantage), 
                ~ factor(., levels = c("Low", "Medium", "High"))),
         area = factor(area, levels = c("Rural", "Urban")))
```
]

We can use `across()` as a shortcut to perform the same operation on multiple variables

This is the same as doing this:

.text-85[

``` r
mutate(incarceration = 
         factor(incarceration, levels = c("Low", "Medium", "High")), 
       disadvantage = 
         factor(disadvantage, levels = c("Low", "Medium", "High")))
```
]

---

# Another `across()`

You can also give `across()` a list of functions and it will apply them to each variable

``` r
communities |>
  group_by(incarceration) |>
  summarize(across(c(crime_rate, pop_density), 
                   list(mu = ~mean(.), 
                        sd = ~sd(.))))
```

```
## # A tibble: 3 × 5
##   incarceration crime_rate_mu crime_rate_sd pop_density_mu
##   <fct>                 <dbl>         <dbl>          <dbl>
## 1 Low                    17.7          16.1           12.8
## 2 Medium                 20.9          18.0           14.2
## 3 High                   37.0          25.3           17.5
## # ℹ 1 more variable: pop_density_sd <dbl>
```

This calculated a `mean()` and `sd()` each for `crime_rate` and `pop_density` within each level of `incarceration` with very little code!

---
class: inverse

# Linear Models

### Different predictors

---

# Continuous

.pull-left[

``` r
lm(crime_rate ~ pop_density, 
   data = communities) |> 
  coef() # extract coefficients
```

```
## (Intercept) pop_density 
##  -20.776945    3.093886
```

We've seen a lot of linear models with continuous predictors

It is easy to visualize what the model is doing with a line laid over a scatterplot
]

.pull-right[

``` r
ggplot(communities, 
       aes(x = pop_density, 
           y = crime_rate)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-5-1.svg)
]

---

# Dummy Variables

What is a regression model doing when you have a binary predictor?

.pull-left[

``` r
lm(crime_rate ~ area, 
   data = communities) |>
  coef()
```

```
## (Intercept)   areaUrban 
##    10.43129    30.18226
```

The data look weird and `geom_smooth()` doesn't even do anything!

Why do the data look like this?
]

.pull-right[

``` r
ggplot(communities, 
       aes(x = area, 
           y = crime_rate)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-7-1.svg)
]

---

# A Slight Modification

.pull-left[

``` r
lm(crime_rate ~ area, 
   data = communities) |>
  coef()
```

```
## (Intercept)   areaUrban 
##    10.43129    30.18226
```

It turns out the linear model actually treats a binary variable as a numeric one that takes values of 0 or 1.

When `areaUrban` is 0 (`area` is `"Rural"`) the mean `crime_rate` is `$10.43$`

When `areaUrban` is 1 (`area` is `"Urban"`) the mean is `$10.43 + 30.18 = 40.61$`

]
.pull-right[

``` r
communities |> 
  mutate(urban = 
    as.numeric(area=="Urban")) |>
  ggplot(aes(x = urban, 
             y = crime_rate)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-9-1.svg)
]

---

# Mean Differences

.pull-left[
Linear regression is just estimating means by group!

``` r
dummy_model <- 
  lm(crime_rate ~ area, 
     data = communities)
dummy_model |> coef()
```

```
## (Intercept)   areaUrban 
##    10.43129    30.18226
```

The intercept is the average crime rate for rural places

The value for `areaUrban` is the **difference** between rural and urban places

]
.pull-right[
Compare the model to group-specific average crime rates:

``` r
communities |>
  group_by(area) |>
  summarize(crime_rate = 
              mean(crime_rate))
```

```
## # A tibble: 2 × 2
##   area  crime_rate
##   <fct>      <dbl>
## 1 Rural       10.4
## 2 Urban       40.6
```

`$\text{Urban} = 40.61 = 10.43 + 30.18$`

]

These are saying the exact same thing!

---

# Hypothesis Testing

Significance testing with dummy variables works like any others

.text-85[

``` r
summary(dummy_model)$coefficients
```

```
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 10.43129   1.270202  8.21231 6.660396e-15
## areaUrban   30.18226   1.814574 16.63325 2.123665e-44
```
]

This is *identical* to a two-sample t-test!

.text-85[

``` r
t.test(crime_rate ~ area, data = communities, var.equal=TRUE)
```

```
## 
## 	Two Sample t-test
## 
## data:  crime_rate by area
## t = -16.633, df = 298, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Rural and group Urban is not equal to 0
## 95 percent confidence interval:
##  -33.75326 -26.61125
## sample estimates:
## mean in group Rural mean in group Urban 
##            10.43129            40.61355
```
]

.pull-right-40[
.footnote[
This is a significance test for a difference in group means—the same as our dummy variable model!
]
]

---

# More Categories

When a variable has more than two categories, all but one category is converted to a mutually exclusive binary variable

.pull-left[

``` r
lm(crime_rate ~ incarceration, 
   data = communities) |> 
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term                estimate
##   <chr>                  <dbl>
## 1 (Intercept)            17.7 
## 2 incarcerationMedium     3.22
## 3 incarcerationHigh      19.3
```
]
.pull-right[

``` r
communities |>
  group_by(incarceration) |>
  summarize(crime_rate = 
              mean(crime_rate))
```

```
## # A tibble: 3 × 2
##   incarceration crime_rate
##   <fct>              <dbl>
## 1 Low                 17.7
## 2 Medium              20.9
## 3 High                37.0
```
]

Here our intercept represents the mean when `incarceration` is neither `"Medium"` nor `"High"`. Our **reference category** is `"Low"`

---

# ANOVA

Some of you may be familiar with ANOVA for analysing how 3+ category variables relate to a continuous outcome

.text-85[

``` r
aov(crime_rate ~ incarceration, data = communities) |> summary()
```

```
##                Df Sum Sq Mean Sq F value  Pr(>F)    
## incarceration   2  20991   10496   25.79 4.7e-11 ***
## Residuals     297 120866     407                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

Turns out ANOVA is just a standard linear model with a categorical variable

``` r
aov(crime_rate ~ incarceration, data = communities) |> coef()
```

```
##         (Intercept) incarcerationMedium   incarcerationHigh 
##           17.726374            3.217159           19.265125
```

Until you get to very advanced methods (i.e., mixed effects models), ANOVA is just a less flexible linear model with more complicated terminology

---

# As an `lm()`

.text-85[

``` r
summary(lm(crime_rate ~ incarceration, data = communities))
```

```
## 
## Call:
## lm(formula = crime_rate ~ incarceration, data = communities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.553 -14.044  -5.796  11.041  73.811 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           17.726      2.070   8.565 5.97e-16 ***
## incarcerationMedium    3.217      2.850   1.129     0.26    
## incarcerationHigh     19.265      2.897   6.649 1.41e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.17 on 297 degrees of freedom
## Multiple R-squared:  0.148,	Adjusted R-squared:  0.1422 
## F-statistic: 25.79 on 2 and 297 DF,  p-value: 4.7e-11
```
]

.footnote[
Note the F-statistic for the regression is the F value for the ANOVA
]

---
class: inverse

# Controls

&nbsp;

![:width 50%](img/yoda.gif)

---

# Controlling for Variables

Remember that we control for variables for the purpose of **identifying** something we're interested in

In causal research, this is some **treatment effect**.

* In the following examples, we'll assume we want to identify `$X \rightarrow Y$`

In other cases, it may just be some association we're interested in.

To know if a control does what you want (e.g., closes a back door), it is important to understand:

* What including that control implies for your model
* Exactly what including it really does numerically

---

# Categorical Controls

We include a categorical control `$Z$` if we believe some of `$X$` and `$Y$`'s apparent relationship is due to *some `$Z$` groups having different average levels of X and Y*

If we remove group-specific averages of both X and Y, we can get rid of that

.text-85[

``` r
residualized_data <- communities |> group_by(area) |>
  mutate(pop_density_res =  pop_density - mean(pop_density), #<< # Subtract mean!
*        crime_rate_res  =  crime_rate  - mean(crime_rate))
```
]

.pull-left[
.text-85[

``` r
lm(crime_rate ~ pop_density + area,
   data = communities) |> 
  tidy() |>
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   -23.0 
## 2 pop_density     3.36
## 3 areaUrban      -3.39
```
]
]
.pull-right[
.text-85[

``` r
lm(crime_rate_res ~ pop_density_res,
   data = residualized_data)  |> 
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term            estimate
##   <chr>              <dbl>
## 1 (Intercept)     4.03e-15
## 2 pop_density_res 3.36e+ 0
```
]
]

Note the identical `pop_density` slopes!

---

## Visualising It

.pull-left[

``` r
ggplot(communities, 
       aes(x = pop_density, 
           y = crime_rate, 
           color = area)) + 
  geom_point() + 
  geom_smooth(method = "lm", 
              color = "black") 
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-22-1.svg)

]
.pull-right[

``` r
ggplot(residualized_data, 
*      aes(x = pop_density_res,
*          y = crime_rate_res,
           color = area))+ 
  geom_point() + 
  geom_smooth(method = "lm",
              color = "black")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-23-1.svg)
]

Rural and urban data points now occupy a similar area due to subtracting their average `$X$` and `$Y$` values—this gives us their **residuals**

---

# Residuals

Residuals are the difference between model predictions and the actual data

.text-85[

``` r
lm_res <- lm(crime_rate ~ pop_density, data = communities) |> augment()
```
]

.pull-left[
.text-85[

``` r
ggplot(lm_res, 
    aes(x = pop_density, 
        y = crime_rate)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-25-1.svg)
]
]
.pull-right[
.text-85[

``` r
ggplot(lm_res,           # augment()
    aes(x = pop_density, # gives us
*       y = .resid)) +   # residuals!
  geom_point() + 
  geom_smooth(method = "lm")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-26-1.svg)
]
]

.text-center[
The residuals are what our model *doesn't explain*
]

---

# `car::avPlot()`

This type of plot is sometimes called an **added variable plot**

.pull-left[

``` r
ggplot(residualized_data, 
       aes(x = pop_density_res,
           y = crime_rate_res))+ 
  geom_point() + 
  geom_smooth(method = "lm")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-27-1.svg)
]

.pull-right[

``` r
lm(crime_rate ~ pop_density + 
   area, 
   data = communities) |> 
  car::avPlot(
    variable = "pop_density")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-28-1.svg)
]

These show the regression line after *controlling for other variables*—and unlike `ggplot2`, `avPlot()` can accept more than one added variables at once

---

# Continuous Variables

We include continuous controls if some of `$X$` and `$Y$`'s relationship is because `$Z$` makes `$X$` and `$Y$` higher/lower—sometimes called **spuriousness**

If we remove variation in `$X$` and `$Y$` due to `$Z$`, we can fix this

Here we use bivariate models predicting what we want to residualise

.text-85[

``` r
residualized_data <- communities %>%
  mutate(area_num       = as.numeric(area == "Urban"),
*        crime_rate_res = residuals(lm(crime_rate ~ pop_density, data = .)),
*        area_res       = residuals(lm(area_num ~ pop_density, data = .)))
```
]

.pull-left[
.text-85[

``` r
lm(crime_rate ~ pop_density + area, 
   data = communities) |> 
  tidy() |> select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   -23.0 
## 2 pop_density     3.36
## 3 areaUrban      -3.39
```

]
]

.pull-right[
.text-85[

``` r
*lm(crime_rate_res ~ area_res,
*  data = residualized_data) |>
  tidy() |> select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term         estimate
##   <chr>           <dbl>
## 1 (Intercept)  7.77e-17
## 2 area_res    -3.39e+ 0
```

]
]

---

## Visualising It

.pull-left[

Original data:

``` r
ggplot(residualized_data, 
       aes(x = area_num, 
           y = crime_rate, 
           color = area)) + 
  geom_point() + 
  geom_smooth(method = "lm", 
              color = "black") 
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-32-1.svg)

]
.pull-right[
Residuals of `$X$` and `$Y$`:

``` r
ggplot(residualized_data, 
*      aes(x = area_res,
*          y = crime_rate_res,
           color = area))+ 
  geom_point() + 
  geom_smooth(method = "lm",
              color = "black")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-33-1.svg)
]

.text-center[
*Wait, wat? I thought `area` was binary?*
]

---

# wat

Yep, that's not a mistake—our model predicted *how urban* each point was and the residual is the *unexpected level of urban* on a 0 to 1 scale—weird!

.pull-left[

``` r
ggplot(residualized_data, 
*      aes(x = area_res,
*          y = crime_rate_res,
           color = area))+ 
  geom_point() + 
  geom_smooth(method = "lm",
              color = "black")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-34-1.svg)
]

.pull-right[

``` r
lm(crime_rate ~ pop_density + 
       area, 
   data = communities) |> 
    car::avPlot(
        variable = "areaUrban")
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-35-1.svg)

]

---
class: inverse

# How It Works

&nbsp;

![:width 80%](img/trebuchet.jpg)

---

# Generating Data

If you don't like doing mathematical proofs to figure out if something works, you can often test things with simulation

`rnorm()` and similar functions randomly generate data *drawn* from a particular distribution

.pull-left[

``` r
sample_size <- 10000
sim_data <- 
  tibble(bd =
*   rnorm(sample_size,
*         mean = 3, sd = 1))
sim_data |>
  summarize(bd_mean = mean(bd), 
            bd_sd   = sd(bd))
```

```
## # A tibble: 1 × 2
##   bd_mean bd_sd
##     <dbl> <dbl>
## 1    2.98  1.01
```
]

.pull-right[

``` r
ggplot(sim_data, 
       aes(x = bd)) +
  geom_density()
```

![](slides_linear-models-i_files/figure-html/unnamed-chunk-37-1.svg)
]

---

# Simulation

Now we can add other variables to our data to generate a dataset that obeys the DAG on the right

.pull-left-60[
.text-85[

``` r
sim_data <- sim_data |>
  mutate(
    x = rnorm(n(), 0.3*bd, 1),
*   y = rnorm(n(), 3 + 0.3*x + 0.3*bd, 1),
    coll = rnorm(n(), -0.2*x + -0.2*y, 1))
```

]
]

.pull-right-40[
![](slides_linear-models-i_files/figure-html/sim-dag-1.svg)
]

"Y is equal to 3 plus `$0.3*X$` plus `$0.3*BD$`, with about 1 SD of error on average"

Note that we generate the variables in causal order of the diagram

* We started with `bd` (on prior slide)
* We finished with `coll`

---

# Estimates with Back Doors

Because we know the **data generating process**—we created it—we know what our estimates should be

.pull-left[

``` r
*lm(y ~ x + bd,
   data = sim_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    3.03 
## 2 x              0.299
## 3 bd             0.290
```
]

.pull-right[

``` r
*lm(y ~ x , # Leaving out bd
   data = sim_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    3.83 
## 2 x              0.379
```
]

Excluding `bd` makes the effect of `x` larger than it should be—it is **magnified**

This occurs whenever a back door has effects with the *same sign* on both `$X$` and `$Y$` (e.g., both positive)—opposite signs would **attenuate** the estimate

---

# Estimates with Colliders

Again, we know what effects should be identified and how to identify them. What if we add the collider?

.pull-left[

``` r
*lm(y ~ x + bd,
   data = sim_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    3.03 
## 2 x              0.299
## 3 bd             0.290
```
]

.pull-right[

``` r
*lm(y ~ x + bd + coll,
   data = sim_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 4 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    2.91 
## 2 x              0.250
## 3 bd             0.282
## 4 coll          -0.184
```
]

This **attenuates** `$X$` but does nothing to `$BD$`—it isn't a collider for `$BD$`!

.text-center[
*Let's look at a more severe example*
]

---

# Selection (Collider) Bias

Let's create the data from CRM yesterday.

``` r
coll_data <- tibble(
  Skill     = runif(500, 0, 10), 
  Frequency = runif(500, 0, 10), 
  Arrested  = Frequency > Skill) 
arrested <- coll_data |> filter(Arrested) |> mutate(Who = "Arrested")
everyone <- coll_data |> mutate(Who = "Everyone")
```

What's going on here?

* *No relationship* between `Skill` at crime and `Frequency` of crime
  * Just two uniformly distributed variables
* Anyone who commits more crime than their skill gets arrested
  * Only these people are in `arrested`

If we only see arrested people, we've *selected* on a collider—this is just like including one in a model

---

# Collider Model

.pull-left[

``` r
*lm(Frequency ~ Skill,
   data = everyone) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   4.45  
## 2 Skill         0.0845
```
]

.pull-right[

``` r
*lm(Frequency ~ Skill,
   data = arrested) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    4.40 
## 2 Skill          0.627
```
]

We know there is no real relationship, but because we **selected** data based on an outcome of Frequency and Skill, it creates a **spurious** relationship

.text-center[
*The collider is still producing bias without being explicitly in the model!*
]

---
## What about controlling?

.pull-left[
Controlling for arrest:

``` r
*lm(Frequency ~ Skill + Arrested,
   data = everyone) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term         estimate
##   <chr>           <dbl>
## 1 (Intercept)    -0.382
## 2 Skill           0.567
## 3 ArrestedTRUE    4.98
```
]

.pull-right[
Only arrested people:

``` r
*lm(Frequency ~ Skill,
   data = arrested) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    4.40 
## 2 Skill          0.627
```
]

.text-center[
*We get bias either way!*
]

---

# Visualizing Collider Bias

``` r
bind_rows(arrested, everyone) |>
  ggplot(aes(x = Skill, y = Frequency, color = Who)) + 
    facet_wrap(~Who) + geom_point() +
    geom_smooth(method = "lm", color = "black") +
    labs(x = "Skill at Crime", y = "Frequency of Crime") +
    theme_minimal(base_size = 16) + theme(legend.position = "none")
```

![](slides_linear-models-i_files/figure-html/coll-plot-1.svg)

---
class: inverse

# Randomization

---

# Non-Randomized Treatment

Let's create example data with actual **potential outcomes**

``` r
n <- 10000
po_data <- tibble(
  bd = runif( n, 0,        1), # Random uniform variable
  x  = rbinom(n, 1,        bd), # Random binary variable
  y0 = rnorm( n, 2*bd,     1),  # untreated outcome
* y1 = rnorm( n, 2*bd + 1, 1)) |> # treated outcome
  mutate(y = ifelse(x==1, y1, y0)) # Treatment just selects outcome
```

* `x` is a binary **treatment** with effect size of **1**
* `y0` is the outcome if the unit is **untreated**
* `y1` is the outcome if the unit is **treated**
* `bd` is a backdoor predicting treatment *and* `y`
   * `bd` improves both potential outcomes *the same* amount
   * `bd` also makes treatment more likely

.text-center[
*Units that get treated tend to have higher potential outcomes*
]

---

# Estimation

Like usual, if we can *see* the back door, we can identify the effect of `$X$` on `$Y$`

.pull-left[

``` r
lm(y ~ x + bd, data = po_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)  0.00277
## 2 x            1.02   
## 3 bd           1.97
```
]
.pull-right[

``` r
lm(y ~ x, data = po_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    0.663
## 2 x              1.68
```
]

.text-center[
*But even if we can't, we can randomize it away*
]

`sample()` is an easy way to randomly select units

``` r
sample(0:1, 20, replace = TRUE) # randomly select 0 or 1 20 times
```

```
##  [1] 1 1 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0
```

---

# Random Assignment

Let's select random units to receive the treatment

``` r
po_data <- po_data |>
  mutate(treat = sample(0:1, n(), replace=TRUE),
         yt = ifelse(treat==1, y1, y0))
```

All the treatment does is determine which potential outcome we see

.pull-left[

``` r
lm(y ~ x + bd, data = po_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 3 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)  0.00277
## 2 x            1.02   
## 3 bd           1.97
```
]
.pull-right[

``` r
lm(yt ~ treat, data = po_data) |>
  tidy() |> 
  select(term, estimate)
```

```
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    0.993
## 2 treat          1.02
```
]

.text-center[
*Randomization closes back doors without adjustment!*
]

---

# Randomization DAGs

Randomization breaks back doors—even when we can't observe the responsible variables!

.pull-left[

Confounded by an unobservable `$BD$`

![](slides_linear-models-i_files/figure-html/bd-dag-1.svg)
]

.pull-right[

`$BD$` is irrelevant if we can randomize!

![](slides_linear-models-i_files/figure-html/ran-dag-1.svg)
]

&nbsp;

If we're interested in the effect of `$X$` here, we have to assume `$T$` is roughly equivalent to `$X$` (it is **consistent** with `$X$`)

---
class: inverse
# Wrap-Up

&zwj;Reminder: Class on 13th **rescheduled** to Monday the 24th

&zwj;Assignment:

* Posted tomorrow morning:
  * On website in units (both week 4 and 5)
* Read instructions carefully
   * No trick questions intended
   * If you're writing a ton of code, you may be overthinking it
* Doc types:
  * Word or equivalent + .R script file for code
     * Use comments to indicate question code applies to
  * Quarto / Rmarkdown if you're feeling fancy
* Due 11:59 PM Friday 15 November
  * **Do not spend that entire time working on it**
  * **Do ask questions**

No reading until next week