class: center, top, title-slide .title[ # Distributions and Relationships ] .subtitle[ ## IQA Lecture 2 ] .author[ ### Charles Lanfear ] .date[ ### 23 Oct 2024
Updated: 23 Oct 2024 ] --- # Today * News items: * Two assessments * *Light* data analysis and interpretation * Due Friday 11:59 PM the following week * Natalia's office hours: * Mon 12:00-13:00 & Weds 16:00-17:00 * Room 1.8 * Topics for today: * Pipes * Subsetting with `{dplyr}` * Creating and modifying variables * Distributions * Tabulations and Cross-Tabulations * Summarizing data * Correlations --- class: inverse # `{tidyverse}` ![:width 40%](img/tidyverse.svg) --- # Installing `{tidyverse}` We're going to practice loading files and manipulating data. -- We will use a packages called `{readr}` and `{dplyr}` to do this neatly. These packages are part of the [Tidyverse](http://tidyverse.org/) family of R packages * These packages make using R *much easier* -- If you have not already installed the tidyverse, type, in the console: `install.packages("tidyverse")` -- This will install a *large* number of R packages we will use throughout the term, including `{readr}`, `{ggplot2}`, and `{dplyr}`. --- # Loading Packages ``` r library(readr) library(ggplot2) library(dplyr) ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` --- # Wait, was that an error? When you load packages in R that have functions sharing the same name as functions you already have, the more recently loaded functions overwrite the previous ones ("masks them"). -- This **message** is just letting you know that. -- Sometimes you may get a **warning message** when loading packages—usually because you aren't running the latest version of R: ``` Warning message: package `dplyr' was built under R version 4.4.1 ``` *Update R* to get rid of these! --- class: inverse # Importing and Exporting Data ![:width 40%](img/readr.svg) --- # Delimited Text Files One of the most common ways for data to be stored is in a *delimited* text file, e.g. comma-separated values (**.csv**) or tab-separated values (**.tsv**). Here is **.csv** data: ``` "Id","Offense","Sex","Month" 101,"Battery","Male",1, 101,"Battery","Male",1, 101,"Robbery","Male",1, 101,"Battery","Male",2, 101,"Robbery","Male",2, 101,"Homicide","Male",3, 103,"Robbery","Female",1, 103,"Robbery","Female",3, 103,"Battery","Female",4, ``` Values are *comma-separated* and observations are separated by line breaks --- # `{readr}` R has a variety of built-in functions for importing delimited text, like `read.table()` and `read.csv()`. I recommend using the versions in the `{readr}` package instead: `read_csv()`, `read_tsv()`, and `read_delim()`: `{readr}` function features: * Faster! * A *little* smarter about dates and times * Handy function `problems()` you can run if there are errors * Loading bars for large files --- # `{readr}` Importing Example Let's use `read_csv()` from `{readr}` to import some community crime data based on those in yesterday's CRM lecture .small[ ``` r communities <- read_csv( "https://clanfear.github.io/ioc_iqa/_data/communities.csv" ) ``` ``` ## Rows: 300 Columns: 5 ## ── Column specification ──────────────────────────────────────────── ## Delimiter: "," ## chr (3): area, disadvantage, incarceration ## dbl (2): pop_density, crime_rate ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ] --- class: inverse # `{dplyr}` ![:width 60%](img/dplyr.svg) --- # Check Out `communities` `{dplyr}` gives us access to the handy `glimpse()` for inspecting dataframes. .text-62[ ``` r glimpse(communities) ``` ``` ## Rows: 300 ## Columns: 5 ## $ area <chr> "Urban", "Urban", "Rural", "Rural", "Urban",… ## $ pop_density <dbl> 18.166008, 21.356727, 10.023975, 14.926138, … ## $ crime_rate <dbl> 28.63123, 58.60800, 14.19840, 22.29075, 79.7… ## $ disadvantage <chr> "Low", "Medium", "Low", "Medium", "High", "H… ## $ incarceration <chr> "High", "Medium", "High", "Medium", "High", … ``` ] --- # Pipes! `{dplyr}` and rest of the Tidyverse are built around using pipe operators (`|>`) Instead of nesting functions like this: ``` r proportions(table(communities$disadvantage)) ``` ``` ## ## High Low Medium ## 0.3366667 0.3400000 0.3233333 ``` -- We can pipe them like this: ``` r communities |> pull(disadvantage) |> table() |> proportions() ``` ``` ## ## High Low Medium ## 0.3366667 0.3400000 0.3233333 ``` -- Read this as, "take `communities`, and then pull out the `incarceration` column, and then make a `table()`, and then calculate `proportions()`." --- # `filter()` Data Frames ``` r communities |> filter(incarceration == "High") |> head() ``` ``` ## # A tibble: 6 × 5 ## area pop_density crime_rate disadvantage incarceration ## <chr> <dbl> <dbl> <chr> <chr> ## 1 Urban 18.2 28.6 Low High ## 2 Rural 10.0 14.2 Low High ## 3 Urban 21.9 79.8 High High ## 4 Urban 18.5 31.1 High High ## 5 Urban 22.5 28.6 Low High ## 6 Rural 8.95 6.53 Medium High ``` .text-center[ *What is this doing?* ] -- `filter()` is a `{dplyr}` function for indexing dataframe **rows** -- It takes *only* logical vectors (the result of **expressions**) as an argument --- # Multiple Conditions .pull-left[ ### And: `&` ``` r communities |> filter(disadvantage == "Low" & crime_rate > 40) ``` ![:width 100%](img/disadvantage_and_crime.svg) ] -- .pull-right[ ### Or: `|` ``` r communities |> filter(disadvantage == "Low" | crime_rate > 40) ``` ![:width 100%](img/disadvantage_or_crime.svg) ] --- # `%in%` Operator Common use case: Filter rows to things in some set. We can use `%in%` like `==` but for matching any element in the vector on its right ``` r communities |> filter(disadvantage %in% c("High", "Low")) |> tail() ``` ``` ## # A tibble: 6 × 5 ## area pop_density crime_rate disadvantage incarceration ## <chr> <dbl> <dbl> <chr> <chr> ## 1 Urban 13.2 10.2 Low Medium ## 2 Urban 18.5 43.9 Low Medium ## 3 Rural 11.0 10.2 Low Medium ## 4 Rural 6.59 2.27 Low Medium ## 5 Urban 16.3 29.5 High High ## 6 Rural 10.7 12.0 Low Medium ``` Read as: "`filter()` to rows where `disadvantage` is `"High"` or `"Low"`" --- #Sorting: `arrange()` Along with filtering the data to see certain rows, we might want to sort it: ``` r communities |> arrange(disadvantage, desc(crime_rate)) |> head() ``` ``` ## # A tibble: 6 × 5 ## area pop_density crime_rate disadvantage incarceration ## <chr> <dbl> <dbl> <chr> <chr> ## 1 Urban 27.9 98.2 High High ## 2 Urban 22.4 98.1 High High ## 3 Urban 22.9 88.7 High High ## 4 Urban 21.1 85.6 High High ## 5 Urban 21.9 79.8 High High ## 6 Urban 22.9 77.7 High High ``` The data are sorted by ascending `disadvantage` and descending `crime_rate`. --- # Keeping Columns: `select()` Not only can we subset rows, but we can include specific columns (and put them in the order listed) using `select()` ``` r communities |> select(area, pop_density, crime_rate) |> head() ``` ``` ## # A tibble: 6 × 3 ## area pop_density crime_rate ## <chr> <dbl> <dbl> ## 1 Urban 18.2 28.6 ## 2 Urban 21.4 58.6 ## 3 Rural 10.0 14.2 ## 4 Rural 14.9 22.3 ## 5 Urban 21.9 79.8 ## 6 Urban 18.5 31.1 ``` --- # Dropping Columns: `select()` We can instead drop only specific columns with select() using - signs: ``` r communities |> select(-area, -pop_density, -crime_rate) |> head() ``` ``` ## # A tibble: 6 × 2 ## disadvantage incarceration ## <chr> <chr> ## 1 Low High ## 2 Medium Medium ## 3 Low High ## 4 Medium Medium ## 5 High High ## 6 High High ``` --- # Renaming with `select()` We can rename columns using `select()`, but that drops everything that isn't mentioned: ``` r communities |> select(Area = area) |> head() ``` ``` ## # A tibble: 6 × 1 ## Area ## <chr> ## 1 Urban ## 2 Urban ## 3 Rural ## 4 Rural ## 5 Urban ## 6 Urban ``` --- # Safer: Rename with `rename()` `rename()` renames variables using the same syntax as `select()` without dropping unmentioned variables ``` r communities |> rename(Area = area) |> head() ``` ``` ## # A tibble: 6 × 5 ## Area pop_density crime_rate disadvantage incarceration ## <chr> <dbl> <dbl> <chr> <chr> ## 1 Urban 18.2 28.6 Low High ## 2 Urban 21.4 58.6 Medium Medium ## 3 Rural 10.0 14.2 Low High ## 4 Rural 14.9 22.3 Medium Medium ## 5 Urban 21.9 79.8 High High ## 6 Urban 18.5 31.1 High High ``` --- # Creating Columns `disadvantage` looks like an ordinal variable (`incarceration` too) but R doesn't know this—it just puts them in alphabetical order -- To fix this, we need to know how to create or modify variables in our data -- `dplyr` uses the `mutate()` function to create or modify variables: ``` r communities |> mutate(high_crime = crime_rate > mean(crime_rate)) |> head(4) ``` ``` ## # A tibble: 4 × 6 ## area pop_density crime_rate disadvantage incarceration high_crime ## <chr> <dbl> <dbl> <chr> <chr> <lgl> ## 1 Urban 18.2 28.6 Low High TRUE ## 2 Urban 21.4 58.6 Medium Medium TRUE ## 3 Rural 10.0 14.2 Low High FALSE ## 4 Rural 14.9 22.3 Medium Medium FALSE ``` This created a **logical** (`TRUE`/`FALSE`) variable because we used a logical expression --- # Modifying Columns In R, we "modify" objects—including columns in our data—by replacing them with new versions -- We saw before that our disadvantage and incarceration variables are ordinal but not being recognized that way -- We can give them a proper order by making them **factors** and specifying their **levels** ``` r *communities <- communities |> # Assigning back to overwrite! mutate(disadvantage = * factor(disadvantage, levels = c("Low", "Medium", "High")), incarceration = factor(incarceration, levels = c("Low", "Medium", "High")) ) ``` To modify the original dataset, we just assign back to it—overwriting it with our changes! --- # Fixed! ``` r communities |> pull(disadvantage) |> table() ``` ``` ## ## Low Medium High ## 102 97 101 ``` ``` r communities |> pull(incarceration) |> table() ``` ``` ## ## Low Medium High ## 95 106 99 ``` --- class: inverse # Distributions ### Numbers today (boo!) ### Pictures next week (fun!) --- # Tabulations Let's look at tabulations first. They're useful for summarizing categorical data. -- `count()` is a `{dplyr}` function for tabulating one or more columns ``` r communities |> count(incarceration) ``` ``` ## # A tibble: 3 × 2 ## incarceration n ## <fct> <int> ## 1 Low 95 ## 2 Medium 106 ## 3 High 99 ``` -- ``` r communities |> count(incarceration) |> mutate(proportion = n/sum(n)) ``` ``` ## # A tibble: 3 × 3 ## incarceration n proportion ## <fct> <int> <dbl> ## 1 Low 95 0.317 ## 2 Medium 106 0.353 ## 3 High 99 0.33 ``` --- # `{janitor}` `{janitor}` is a data cleaning package, but it makes tabulations easier too ``` r install.packages("janitor") ``` ``` r library(janitor) ``` ``` ## ## Attaching package: 'janitor' ``` ``` ## The following objects are masked from 'package:stats': ## ## chisq.test, fisher.test ``` ``` r communities |> tabyl(incarceration) ``` ``` ## incarceration n percent ## Low 95 0.3166667 ## Medium 106 0.3533333 ## High 99 0.3300000 ``` -- .pull-right[ .footnote[ It comes in *really* handy for **cross-tabs**—we'll see them soon! ] ] --- # `summarize()` `{dplyr}`'s **`summarize()`** takes your column(s) of data and computes something using *every row*: * Calculate the mean (`mean()`) * Calculate the standard deviation (`sd()`) * Obtain a sample size (`n()`) -- ``` r communities |> summarize(mean_crime_rate = mean(crime_rate), sd_crime_rate = sd(crime_rate), n = n()) ``` ``` ## # A tibble: 1 × 3 ## mean_crime_rate sd_crime_rate n ## <dbl> <dbl> <int> ## 1 25.2 21.8 300 ``` -- You can use any function in `summarize()` that aggregates *multiple values* into a *single value* (like `sum()`, `median()`, or `max()`). --- class: inverse # Associations ### Which are really about **joint distributions** --- # Cross-Tabs Let's look at cross-tabs first. They're used for associations between categorical variables. -- This is where `{janitor}`'s `tabyl()` begins to shine: -- ``` r communities |> tabyl(disadvantage, incarceration) ``` ``` ## disadvantage Low Medium High ## Low 40 45 17 ## Medium 31 30 36 ## High 24 31 46 ``` -- ``` r communities |> tabyl(disadvantage, incarceration) |> * adorn_percentages() # converts to cell percentages ``` ``` ## disadvantage Low Medium High ## Low 0.3921569 0.4411765 0.1666667 ## Medium 0.3195876 0.3092784 0.3711340 ## High 0.2376238 0.3069307 0.4554455 ``` --- # Fancy Cross-Tabs 1 We can assemble *fancy* tables bit-by-bit with `{janitor}` ``` r communities |> tabyl(disadvantage, incarceration) # make table ``` ``` ## disadvantage Low Medium High ## Low 40 45 17 ## Medium 31 30 36 ## High 24 31 46 ``` --- count: false # Fancy Cross-Tabs 2 Add row and column totals! ``` r communities |> tabyl(disadvantage, incarceration) |> # make table adorn_totals(c("row", "col")) # add row/col totals ``` ``` ## disadvantage Low Medium High Total ## Low 40 45 17 102 ## Medium 31 30 36 97 ## High 24 31 46 101 ## Total 95 106 99 300 ``` --- count: false # Fancy Cross-Tabs 3 Turn cells into (row) percentages instead of counts ``` r communities |> tabyl(disadvantage, incarceration) |> # make table adorn_totals(c("row", "col")) |> # add row/col totals adorn_percentages()# make cells proportions ``` ``` ## disadvantage Low Medium High Total ## Low 0.3921569 0.4411765 0.1666667 1 ## Medium 0.3195876 0.3092784 0.3711340 1 ## High 0.2376238 0.3069307 0.4554455 1 ## Total 0.3166667 0.3533333 0.3300000 1 ``` --- count: false # Fancy Cross-Tabs 4 Round those percentages to two decimal places! ``` r communities |> tabyl(disadvantage, incarceration) |> # make table adorn_totals(c("row", "col")) |> # add row/col totals adorn_percentages() |> # make cells percentages adorn_pct_formatting(digits = 1) # percents with 1 digit ``` ``` ## disadvantage Low Medium High Total ## Low 39.2% 44.1% 16.7% 100.0% ## Medium 32.0% 30.9% 37.1% 100.0% ## High 23.8% 30.7% 45.5% 100.0% ## Total 31.7% 35.3% 33.0% 100.0% ``` --- count: false # Fancy Cross-Tabs 5 Add counts back in parentheses! ``` r communities |> tabyl(disadvantage, incarceration) |> # make table adorn_totals(c("row", "col")) |> # add row/col totals adorn_percentages() |> # make cells percentages adorn_pct_formatting(digits = 1) |> # round to 2 digits adorn_ns() # add counts in parentheses ``` ``` ## disadvantage Low Medium High Total ## Low 39.2% (40) 44.1% (45) 16.7% (17) 100.0% (102) ## Medium 32.0% (31) 30.9% (30) 37.1% (36) 100.0% (97) ## High 23.8% (24) 30.7% (31) 45.5% (46) 100.0% (101) ## Total 31.7% (95) 35.3% (106) 33.0% (99) 100.0% (300) ``` --- count: false # Fancy Cross-Tabs 6 Add column variable name! ``` r communities |> tabyl(disadvantage, incarceration) |> # make table adorn_totals(c("row", "col")) |> # add row/col totals adorn_percentages() |> # make cells percentages adorn_pct_formatting(digits = 1) |> # round to 2 digits adorn_ns() |> # add counts in parentheses adorn_title() # add col variable name ``` ``` ## incarceration ## disadvantage Low Medium High Total ## Low 39.2% (40) 44.1% (45) 16.7% (17) 100.0% (102) ## Medium 32.0% (31) 30.9% (30) 37.1% (36) 100.0% (97) ## High 23.8% (24) 30.7% (31) 45.5% (46) 100.0% (101) ## Total 31.7% (95) 35.3% (106) 33.0% (99) 100.0% (300) ``` Not bad! That's not far from paper ready! --- # Grouped Measures If we wanted to calculated means for different groups, one easy way is to just subset the data -- .pull-left[ ``` r communities |> * filter(disadvantage=="High") |> summarize( mean_crime = mean(crime_rate), sd_crime = sd(crime_rate), n = n()) ``` ``` ## # A tibble: 1 × 3 ## mean_crime sd_crime n ## <dbl> <dbl> <int> ## 1 31.9 24.4 101 ``` ] .pull-right[ ``` r communities |> * filter(disadvantage=="Low") |> summarize( mean_crime = mean(crime_rate), sd_crime = sd(crime_rate), n = n()) ``` ``` ## # A tibble: 1 × 3 ## mean_crime sd_crime n ## <dbl> <dbl> <int> ## 1 18.7 17.4 102 ``` ] -- Imagine if you had many groups, though! There's a better way. --- # `group_by()` The special function `group_by()` changes how functions operate on the data, most importantly `summarize()`. Functions after `group_by()` are computed *within each group* as defined by variables given, rather than over all rows at once. Excel analogue: pivot tables .image-50[![Pivot table](http://www.excel-easy.com/data-analysis/images/pivot-tables/two-dimensional-pivot-table.png)] --- # `group_by()` example ``` r communities |> * group_by(disadvantage) |> summarize(mean_crime = mean(crime_rate), sd_crime = sd(crime_rate), n = n()) ``` ``` ## # A tibble: 3 × 4 ## disadvantage mean_crime sd_crime n ## <fct> <dbl> <dbl> <int> ## 1 Low 18.7 17.4 102 ## 2 Medium 25.2 21.2 97 ## 3 High 31.9 24.4 101 ``` Because we did `group_by()` with `disadvantage` then used `summarize()`, we get *one row per value of `disadvantage`*! Each value of disadvantage is its own **group**! --- # `.by = ` example You can also do grouping within `summarize()` or `mutate()` using `.by = ` ``` r communities |> summarize(mean_crime = mean(crime_rate), sd_crime = sd(crime_rate), n = n(), * .by = disadvantage) ``` ``` ## # A tibble: 3 × 4 ## disadvantage mean_crime sd_crime n ## <fct> <dbl> <dbl> <int> ## 1 Low 18.7 17.4 102 ## 2 Medium 25.2 21.2 97 ## 3 High 31.9 24.4 101 ``` I recommend using this method when possible as it automatically ungroups afterward --- # Correlations Correlations are mathematically complicated but simple to code ``` r cor(communities$pop_density, communities$crime_rate) ``` ``` ## [1] 0.8064513 ``` -- Alternatively, you can use the `with()` command for a bit less typing ``` r with(communities, cor(pop_density, crime_rate)) ``` ``` ## [1] 0.8064513 ``` -- These are a bit easier than `dplyr` if you just want one correlation ``` r communities |> summarize(R = cor(pop_density, crime_rate)) ``` ``` ## # A tibble: 1 × 1 ## R ## <dbl> ## 1 0.806 ``` --- # Multiple Correlations If you want to do correlations *within groups*, `{dplyr}` is king again ``` r communities |> group_by(disadvantage) |> summarize(R = cor(pop_density, crime_rate)) ``` ``` ## # A tibble: 3 × 2 ## disadvantage R ## <fct> <dbl> ## 1 Low 0.743 ## 2 Medium 0.822 ## 3 High 0.826 ``` In this case, the correlation between `pop_density` and `crime_rate` is similar at all levels of disadvantage --- # Correlation Matrices `cor()` produces correlation matrices when given multiple continuous variables ``` r anscombe |> # Anscombe's quartet from yesterday! select(x1, y1, y2, y3, y4) |> cor() ``` ``` ## x1 y1 y2 y3 y4 ## x1 1.0000000 0.8164205 0.8162365 0.8162867 -0.3140467 ## y1 0.8164205 1.0000000 0.7500054 0.4687167 -0.4891162 ## y2 0.8162365 0.7500054 1.0000000 0.5879193 -0.4780949 ## y3 0.8162867 0.4687167 0.5879193 1.0000000 -0.1554718 ## y4 -0.3140467 -0.4891162 -0.4780949 -0.1554718 1.0000000 ``` --- class: inverse # Wrap-Up * Recommended reading for next week * Kaplan, chapter 10 is review of today's content * Chapter 14 (graphing) is what we'll get into next week * You're pretty busy, so make it low priority! * `{swirl}` units if you want practice * 5 covers missing values * 6 covers subsetting * Next time: * More on relationships * Distributions * Maybe a start on inference