Distributions and Relationships

class: center, top, title-slide

.title[
# Distributions and Relationships
]
.subtitle[
## IQA Lecture 2
]
.author[
### Charles Lanfear
]
.date[
### 23 Oct 2024<br>Updated: 23 Oct 2024
]

---

# Today

* News items:
   * Two assessments
      * *Light* data analysis and interpretation
      * Due Friday 11:59 PM the following week
   * Natalia's office hours: 
      * Mon 12:00-13:00 & Weds 16:00-17:00
      * Room 1.8

* Topics for today:

* Pipes 
   * Subsetting with `{dplyr}`
   * Creating and modifying variables
   * Distributions
      * Tabulations and Cross-Tabulations
      * Summarizing data
      * Correlations

---
class: inverse

# `{tidyverse}`

&nbsp;

![:width 40%](img/tidyverse.svg)

---

# Installing `{tidyverse}`

We're going to practice loading files and manipulating data.

We will use a packages called `{readr}` and `{dplyr}` to do this neatly.

These packages are part of the [Tidyverse](http://tidyverse.org/) family of R packages

* These packages make using R *much easier*

If you have not already installed the tidyverse, type, in the console: `install.packages("tidyverse")`

This will install a *large* number of R packages we will use throughout the term, including `{readr}`, `{ggplot2}`, and `{dplyr}`.

---

# Loading Packages

``` r
library(readr)
library(ggplot2)
library(dplyr)
```

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

---

# Wait, was that an error?

When you load packages in R that have functions sharing the same name as functions you already have, the more recently loaded functions overwrite the previous ones ("masks them").

This **message** is just letting you know that.

Sometimes you may get a **warning message** when loading packages—usually because you aren't running the latest version of R:

```
Warning message:
package `dplyr' was built under R version 4.4.1
```

*Update R* to get rid of these!

---
class: inverse
# Importing and Exporting Data

&nbsp;

![:width 40%](img/readr.svg)

---

# Delimited Text Files

One of the most common ways for data to be stored is in a *delimited* text file, e.g. comma-separated values (**.csv**) or tab-separated values (**.tsv**). Here is **.csv** data:

```
"Id","Offense","Sex","Month"
101,"Battery","Male",1,
101,"Battery","Male",1,
101,"Robbery","Male",1,
101,"Battery","Male",2,
101,"Robbery","Male",2,
101,"Homicide","Male",3,
103,"Robbery","Female",1,
103,"Robbery","Female",3,
103,"Battery","Female",4,
```

Values are *comma-separated* and observations are separated by line breaks

---
# `{readr}`

R has a variety of built-in functions for importing delimited text, like `read.table()` and `read.csv()`.

I recommend using the versions in the `{readr}` package instead: `read_csv()`, `read_tsv()`, and `read_delim()`:

`{readr}` function features:

* Faster!
* A *little* smarter about dates and times
* Handy function `problems()` you can run if there are errors
* Loading bars for large files

---

# `{readr}` Importing Example

Let's use `read_csv()` from `{readr}` to import some community crime data based on those in yesterday's CRM lecture

.small[

``` r
communities <- 
  read_csv(
    "https://clanfear.github.io/ioc_iqa/_data/communities.csv"
    )
```

```
## Rows: 300 Columns: 5
## ── Column specification ────────────────────────────────────────────
## Delimiter: ","
## chr (3): area, disadvantage, incarceration
## dbl (2): pop_density, crime_rate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```
]

---
class: inverse

# `{dplyr}`

&nbsp;

![:width 60%](img/dplyr.svg)

---

# Check Out `communities`

`{dplyr}` gives us access to the handy `glimpse()` for inspecting dataframes.

.text-62[

``` r
glimpse(communities)
```

```
## Rows: 300
## Columns: 5
## $ area          <chr> "Urban", "Urban", "Rural", "Rural", "Urban",…
## $ pop_density   <dbl> 18.166008, 21.356727, 10.023975, 14.926138, …
## $ crime_rate    <dbl> 28.63123, 58.60800, 14.19840, 22.29075, 79.7…
## $ disadvantage  <chr> "Low", "Medium", "Low", "Medium", "High", "H…
## $ incarceration <chr> "High", "Medium", "High", "Medium", "High", …
```
]

---

# Pipes!

`{dplyr}` and rest of the Tidyverse are built around using pipe operators (`|>`)

Instead of nesting functions like this:

``` r
proportions(table(communities$disadvantage))
```

```
## 
##      High       Low    Medium 
## 0.3366667 0.3400000 0.3233333
```

We can pipe them like this:

``` r
communities |> pull(disadvantage) |> table() |> proportions()
```

```
## 
##      High       Low    Medium 
## 0.3366667 0.3400000 0.3233333
```

Read this as, "take `communities`, and then pull out the `incarceration` column, and then make a `table()`, and then calculate `proportions()`."

---

# `filter()` Data Frames

``` r
communities |> filter(incarceration == "High") |> head()
```

```
## # A tibble: 6 × 5
##   area  pop_density crime_rate disadvantage incarceration
##   <chr>       <dbl>      <dbl> <chr>        <chr>        
## 1 Urban       18.2       28.6  Low          High         
## 2 Rural       10.0       14.2  Low          High         
## 3 Urban       21.9       79.8  High         High         
## 4 Urban       18.5       31.1  High         High         
## 5 Urban       22.5       28.6  Low          High         
## 6 Rural        8.95       6.53 Medium       High
```

.text-center[
*What is this doing?*
]

`filter()` is a `{dplyr}` function for indexing dataframe **rows**

It takes *only* logical vectors (the result of **expressions**) as an argument

---
# Multiple Conditions

.pull-left[

### And: `&`

``` r
communities |>
  filter(disadvantage == "Low" & 
         crime_rate > 40)
```

![:width 100%](img/disadvantage_and_crime.svg)

]

.pull-right[

### Or: `|`

``` r
communities |>
  filter(disadvantage == "Low" | 
         crime_rate > 40)
```

![:width 100%](img/disadvantage_or_crime.svg)
]

---
# `%in%` Operator

Common use case: Filter rows to things in some set.

We can use `%in%` like `==` but for matching any element in the vector on its right

``` r
communities |>
  filter(disadvantage %in% c("High", "Low")) |>
  tail()
```

```
## # A tibble: 6 × 5
##   area  pop_density crime_rate disadvantage incarceration
##   <chr>       <dbl>      <dbl> <chr>        <chr>        
## 1 Urban       13.2       10.2  Low          Medium       
## 2 Urban       18.5       43.9  Low          Medium       
## 3 Rural       11.0       10.2  Low          Medium       
## 4 Rural        6.59       2.27 Low          Medium       
## 5 Urban       16.3       29.5  High         High         
## 6 Rural       10.7       12.0  Low          Medium
```

Read as: "`filter()` to rows where `disadvantage` is `"High"` or `"Low"`"

---
#Sorting: `arrange()`

Along with filtering the data to see certain rows, we might want to sort it:

``` r
communities |>
  arrange(disadvantage, desc(crime_rate)) |> head()
```

```
## # A tibble: 6 × 5
##   area  pop_density crime_rate disadvantage incarceration
##   <chr>       <dbl>      <dbl> <chr>        <chr>        
## 1 Urban        27.9       98.2 High         High         
## 2 Urban        22.4       98.1 High         High         
## 3 Urban        22.9       88.7 High         High         
## 4 Urban        21.1       85.6 High         High         
## 5 Urban        21.9       79.8 High         High         
## 6 Urban        22.9       77.7 High         High
```

The data are sorted by ascending `disadvantage` and descending `crime_rate`.

---
# Keeping Columns: `select()`

Not only can we subset rows, but we can include specific columns (and put them in the order listed) using `select()`

``` r
communities |> select(area, pop_density, crime_rate) |> head()
```

```
## # A tibble: 6 × 3
##   area  pop_density crime_rate
##   <chr>       <dbl>      <dbl>
## 1 Urban        18.2       28.6
## 2 Urban        21.4       58.6
## 3 Rural        10.0       14.2
## 4 Rural        14.9       22.3
## 5 Urban        21.9       79.8
## 6 Urban        18.5       31.1
```

---
# Dropping Columns: `select()`

We can instead drop only specific columns with select() using - signs:

``` r
communities |> select(-area, -pop_density, -crime_rate) |> head()
```

```
## # A tibble: 6 × 2
##   disadvantage incarceration
##   <chr>        <chr>        
## 1 Low          High         
## 2 Medium       Medium       
## 3 Low          High         
## 4 Medium       Medium       
## 5 High         High         
## 6 High         High
```

---
# Renaming with `select()`

We can rename columns using `select()`, but that drops everything that isn't mentioned:

``` r
communities |> select(Area = area) |> head()
```

```
## # A tibble: 6 × 1
##   Area 
##   <chr>
## 1 Urban
## 2 Urban
## 3 Rural
## 4 Rural
## 5 Urban
## 6 Urban
```

---
# Safer: Rename with `rename()`

`rename()` renames variables using the same syntax as `select()` without dropping unmentioned variables

``` r
communities |> rename(Area = area) |> head()
```

```
## # A tibble: 6 × 5
##   Area  pop_density crime_rate disadvantage incarceration
##   <chr>       <dbl>      <dbl> <chr>        <chr>        
## 1 Urban        18.2       28.6 Low          High         
## 2 Urban        21.4       58.6 Medium       Medium       
## 3 Rural        10.0       14.2 Low          High         
## 4 Rural        14.9       22.3 Medium       Medium       
## 5 Urban        21.9       79.8 High         High         
## 6 Urban        18.5       31.1 High         High
```

---
# Creating Columns

`disadvantage` looks like an ordinal variable (`incarceration` too) but R doesn't know this—it just puts them in alphabetical order

To fix this, we need to know how to create or modify variables in our data

`dplyr` uses the `mutate()` function to create or modify variables:

``` r
communities |>
  mutate(high_crime = crime_rate > mean(crime_rate)) |>
  head(4)
```

```
## # A tibble: 4 × 6
##   area  pop_density crime_rate disadvantage incarceration high_crime
##   <chr>       <dbl>      <dbl> <chr>        <chr>         <lgl>     
## 1 Urban        18.2       28.6 Low          High          TRUE      
## 2 Urban        21.4       58.6 Medium       Medium        TRUE      
## 3 Rural        10.0       14.2 Low          High          FALSE     
## 4 Rural        14.9       22.3 Medium       Medium        FALSE
```

This created a **logical** (`TRUE`/`FALSE`) variable because we used a logical expression

---
# Modifying Columns

In R, we "modify" objects—including columns in our data—by replacing them with new versions

We saw before that our disadvantage and incarceration variables are ordinal but not being recognized that way

We can give them a proper order by making them **factors** and specifying their **levels**

``` r
*communities <- communities |>  # Assigning back to overwrite!
  mutate(disadvantage = 
*          factor(disadvantage, levels = c("Low", "Medium", "High")),
         incarceration = 
           factor(incarceration, levels = c("Low", "Medium", "High"))
         )
```

To modify the original dataset, we just assign back to it—overwriting it with our changes!

---
# Fixed!

``` r
communities |> pull(disadvantage) |> table()
```

```
## 
##    Low Medium   High 
##    102     97    101
```

``` r
communities |> pull(incarceration) |> table()
```

```
## 
##    Low Medium   High 
##     95    106     99
```

---
class: inverse

# Distributions

&nbsp;

### Numbers today (boo!)

&nbsp;

### Pictures next week (fun!)

---

# Tabulations

Let's look at tabulations first. They're useful for summarizing categorical data.

`count()` is a `{dplyr}` function for tabulating one or more columns

``` r
communities |> count(incarceration)
```

```
## # A tibble: 3 × 2
##   incarceration     n
##   <fct>         <int>
## 1 Low              95
## 2 Medium          106
## 3 High             99
```

``` r
communities |> count(incarceration) |> mutate(proportion = n/sum(n))
```

```
## # A tibble: 3 × 3
##   incarceration     n proportion
##   <fct>         <int>      <dbl>
## 1 Low              95      0.317
## 2 Medium          106      0.353
## 3 High             99      0.33
```

---

# `{janitor}`

`{janitor}` is a data cleaning package, but it makes tabulations easier too

``` r
install.packages("janitor")
```

``` r
library(janitor)
```

```
## 
## Attaching package: 'janitor'
```

```
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
```

``` r
communities |> tabyl(incarceration)
```

```
##  incarceration   n   percent
##            Low  95 0.3166667
##         Medium 106 0.3533333
##           High  99 0.3300000
```

.pull-right[
.footnote[
It comes in *really* handy for **cross-tabs**—we'll see them soon!
]
]

---
# `summarize()`

`{dplyr}`'s **`summarize()`** takes your column(s) of data and computes something using *every row*:

* Calculate the mean (`mean()`)
* Calculate the standard deviation (`sd()`)
* Obtain a sample size (`n()`)
--

``` r
communities |>
  summarize(mean_crime_rate = mean(crime_rate),
            sd_crime_rate   = sd(crime_rate), 
            n               = n())
```

```
## # A tibble: 1 × 3
##   mean_crime_rate sd_crime_rate     n
##             <dbl>         <dbl> <int>
## 1            25.2          21.8   300
```

You can use any function in `summarize()` that aggregates *multiple values* into a *single value* (like `sum()`, `median()`, or `max()`).

---
class: inverse

# Associations

&nbsp;

### Which are really about **joint distributions**

---

# Cross-Tabs

Let's look at cross-tabs first. They're used for associations between categorical variables.

This is where `{janitor}`'s `tabyl()` begins to shine:

``` r
communities |> tabyl(disadvantage, incarceration)
```

```
##  disadvantage Low Medium High
##           Low  40     45   17
##        Medium  31     30   36
##          High  24     31   46
```

``` r
communities |> tabyl(disadvantage, incarceration) |>
* adorn_percentages() # converts to cell percentages
```

```
##  disadvantage       Low    Medium      High
##           Low 0.3921569 0.4411765 0.1666667
##        Medium 0.3195876 0.3092784 0.3711340
##          High 0.2376238 0.3069307 0.4554455
```

---

# Fancy Cross-Tabs 1

We can assemble *fancy* tables bit-by-bit with `{janitor}`

``` r
communities |> 
  tabyl(disadvantage, incarceration) # make table
```

```
##  disadvantage Low Medium High
##           Low  40     45   17
##        Medium  31     30   36
##          High  24     31   46
```

---
count: false

# Fancy Cross-Tabs 2

Add row and column totals!

``` r
communities |> 
  tabyl(disadvantage, incarceration) |> # make table
  adorn_totals(c("row", "col")) # add row/col totals
```

```
##  disadvantage Low Medium High Total
##           Low  40     45   17   102
##        Medium  31     30   36    97
##          High  24     31   46   101
##         Total  95    106   99   300
```

---
count: false

# Fancy Cross-Tabs 3

Turn cells into (row) percentages instead of counts

``` r
communities |> 
  tabyl(disadvantage, incarceration) |> # make table
  adorn_totals(c("row", "col")) |> # add row/col totals
  adorn_percentages()# make cells proportions
```

```
##  disadvantage       Low    Medium      High Total
##           Low 0.3921569 0.4411765 0.1666667     1
##        Medium 0.3195876 0.3092784 0.3711340     1
##          High 0.2376238 0.3069307 0.4554455     1
##         Total 0.3166667 0.3533333 0.3300000     1
```

---
count: false

# Fancy Cross-Tabs 4

Round those percentages to two decimal places!

``` r
communities |> 
  tabyl(disadvantage, incarceration) |> # make table
  adorn_totals(c("row", "col")) |> # add row/col totals
  adorn_percentages() |> # make cells percentages
  adorn_pct_formatting(digits = 1) # percents with 1 digit
```

```
##  disadvantage   Low Medium  High  Total
##           Low 39.2%  44.1% 16.7% 100.0%
##        Medium 32.0%  30.9% 37.1% 100.0%
##          High 23.8%  30.7% 45.5% 100.0%
##         Total 31.7%  35.3% 33.0% 100.0%
```

---
count: false

# Fancy Cross-Tabs 5

Add counts back in parentheses!

```
##  disadvantage        Low      Medium       High        Total
##           Low 39.2% (40) 44.1%  (45) 16.7% (17) 100.0% (102)
##        Medium 32.0% (31) 30.9%  (30) 37.1% (36) 100.0%  (97)
##          High 23.8% (24) 30.7%  (31) 45.5% (46) 100.0% (101)
##         Total 31.7% (95) 35.3% (106) 33.0% (99) 100.0% (300)
```

---
count: false

# Fancy Cross-Tabs 6

Add column variable name!

```
##               incarceration                                    
##  disadvantage           Low      Medium       High        Total
##           Low    39.2% (40) 44.1%  (45) 16.7% (17) 100.0% (102)
##        Medium    32.0% (31) 30.9%  (30) 37.1% (36) 100.0%  (97)
##          High    23.8% (24) 30.7%  (31) 45.5% (46) 100.0% (101)
##         Total    31.7% (95) 35.3% (106) 33.0% (99) 100.0% (300)
```

Not bad! That's not far from paper ready!

---

# Grouped Measures

If we wanted to calculated means for different groups, one easy way is to just subset the data

.pull-left[

``` r
communities |>
* filter(disadvantage=="High") |>
  summarize(
   mean_crime = mean(crime_rate),
   sd_crime   = sd(crime_rate), 
   n          = n())
```

```
## # A tibble: 1 × 3
##   mean_crime sd_crime     n
##        <dbl>    <dbl> <int>
## 1       31.9     24.4   101
```
]

.pull-right[

``` r
communities |>
* filter(disadvantage=="Low") |>
  summarize(
   mean_crime = mean(crime_rate),
   sd_crime   = sd(crime_rate), 
   n          = n())
```

```
## # A tibble: 1 × 3
##   mean_crime sd_crime     n
##        <dbl>    <dbl> <int>
## 1       18.7     17.4   102
```
]

Imagine if you had many groups, though! There's a better way.

---
  
# `group_by()`

The special function `group_by()` changes how functions operate on the data, most importantly `summarize()`.

Functions after `group_by()` are computed *within each group* as defined by variables given, rather than over all rows at once.

Excel analogue: pivot tables

.image-50[![Pivot table](http://www.excel-easy.com/data-analysis/images/pivot-tables/two-dimensional-pivot-table.png)]

---
# `group_by()` example

``` r
communities |>
* group_by(disadvantage) |>
  summarize(mean_crime = mean(crime_rate),
            sd_crime   = sd(crime_rate), 
            n          = n())
```

```
## # A tibble: 3 × 4
##   disadvantage mean_crime sd_crime     n
##   <fct>             <dbl>    <dbl> <int>
## 1 Low                18.7     17.4   102
## 2 Medium             25.2     21.2    97
## 3 High               31.9     24.4   101
```

Because we did `group_by()` with `disadvantage` then used `summarize()`, we get *one row per value of `disadvantage`*!

Each value of disadvantage is its own **group**!

---
# `.by = ` example

You can also do grouping within `summarize()` or `mutate()` using `.by = `

``` r
communities |>
  summarize(mean_crime = mean(crime_rate),
            sd_crime   = sd(crime_rate), 
            n          = n(),
*           .by = disadvantage)
```

I recommend using this method when possible as it automatically ungroups afterward

---

# Correlations

Correlations are mathematically complicated but simple to code

``` r
cor(communities$pop_density, communities$crime_rate)
```

```
## [1] 0.8064513
```

Alternatively, you can use the `with()` command for a bit less typing

``` r
with(communities, cor(pop_density, crime_rate))
```

```
## [1] 0.8064513
```

These are a bit easier than `dplyr` if you just want one correlation

``` r
communities |> summarize(R = cor(pop_density, crime_rate))
```

```
## # A tibble: 1 × 1
##       R
##   <dbl>
## 1 0.806
```

---

# Multiple Correlations

If you want to do correlations *within groups*, `{dplyr}` is king again

``` r
communities |>
  group_by(disadvantage) |>
  summarize(R = cor(pop_density, crime_rate))
```

```
## # A tibble: 3 × 2
##   disadvantage     R
##   <fct>        <dbl>
## 1 Low          0.743
## 2 Medium       0.822
## 3 High         0.826
```

In this case, the correlation between `pop_density` and `crime_rate` is similar at all levels of disadvantage

---
# Correlation Matrices

`cor()` produces correlation matrices when given multiple continuous variables

``` r
anscombe |> # Anscombe's quartet from yesterday!
  select(x1, y1, y2, y3, y4) |> 
  cor()
```

```
##            x1         y1         y2         y3         y4
## x1  1.0000000  0.8164205  0.8162365  0.8162867 -0.3140467
## y1  0.8164205  1.0000000  0.7500054  0.4687167 -0.4891162
## y2  0.8162365  0.7500054  1.0000000  0.5879193 -0.4780949
## y3  0.8162867  0.4687167  0.5879193  1.0000000 -0.1554718
## y4 -0.3140467 -0.4891162 -0.4780949 -0.1554718  1.0000000
```

---
class: inverse

# Wrap-Up
  
* Recommended reading for next week

* Kaplan, chapter 10 is review of today's content
      * Chapter 14 (graphing) is what we'll get into next week
   * You're pretty busy, so make it low priority!

* `{swirl}` units if you want practice

* 5 covers missing values
   * 6 covers subsetting

* Next time:

* More on relationships
   * Distributions
   * Maybe a start on inference