R Exposure 2

class: center, top, title-slide

.title[
# R Exposure 2
]
.subtitle[
## Data Visualization and Management
]
.author[
### Charles Lanfear
]
.date[
### Aug 22, 2023 Updated: Aug 21, 2023
]

---

# Overview

1. Visualizing Data

2. Summarizing Data

3. Tidying Data

4. Joining Data

5. Resources for Further Learning

---
# Setup

To follow along, you can either:

1. [Download and work from the R script](https://clanfear.github.io/r_exposure_workshop/lectures/r2/r_exposure_2_intermediate.R)

* *Recommended for today*

2. [Download the entire series as a Zip](https://github.com/clanfear/r_exposure_workshop/zipball/master)1

* Unzip to a folder
   * Open the RStudio project file

.footnote[[1] If you downloaded it before today, it may be a bit out of date!]

---
class: inverse

# `ggplot2`

.center[
<img src="img/ggplot2_logo.png" style="width: 40%;"/>
]

---
# Setup

To give us something to visualize, we'll load the `gapminder` data from the last unit. We'll also load `dplyr` to give us tools to manipulate it for visualization--and later, to summarize, tidy, and join data.

```r
library(gapminder)
library(dplyr)
China <- gapminder |>
 filter(country == "China")
head(China, 4)
```

```
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 China Asia 1952 44 556263527 400.
## 2 China Asia 1957 50.5 637408000 576.
## 3 China Asia 1962 44.5 665770000 488.
## 4 China Asia 1967 58.4 754550000 613.
```

---

## Base R Plots

.pull-left[
 .small[

```r
plot(lifeExp ~ year, 
     data = China, 
     xlab = "Year", 
     ylab = "Life expectancy",
     main = "Life expectancy in China", 
     col = "red", 
     cex.lab = 1.5,
     cex.main= 1.5,
     pch = 16)
```
 ]
]

.pull-right[
![](r_exposure_2_intermediate_files/figure-html/unnamed-chunk-77-1.png)
]

---
# `ggplot2`

An alternative way of plotting many prefer (myself included)1 uses the `ggplot2` package in R, which is part of the `tidyverse`.

.footnote[[1] [Though this is not without debate](http://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/)]

```r
library(ggplot2)
```

The core idea underlying this package is the [**layered grammar of graphics**](https://doi.org/10.1198/jcgs.2009.07098): we can break up elements of a plot into pieces and combine them.

---
## Chinese Life Expectancy in `ggplot`

.pull-left[
 .small[

```r
ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
    geom_point()
```
]
]

.pull-right[
![](r_exposure_2_intermediate_files/figure-html/unnamed-chunk-80-1.svg)
]

---
# Structure of a ggplot

`ggplot2` graphics objects consist of two primary components:

1. **Layers**, the components of a graph.

* We *add* layers to a `ggplot2` object using `+`.
   * This includes lines, shapes, and text.

2. **Aesthetics**, which determine how the layers appear.

* We *set* aesthetics using *arguments* (e.g. `color="red"`) inside layer functions.
   * This includes locations, colors, and sizes.
   * Aesthetics also determine how data *map* to appearances.

---

# Layers

**Layers** are the components of the graph, such as:

* `ggplot()`: initializes `ggplot2` object, specifies input data
* `geom_point()`: layer of scatterplot points
* `geom_line()`: layer of lines
* `ggtitle()`, `xlab()`, `ylab()`: layers of labels
* `facet_wrap()`: layer creating separate panels stratified by some factor wrapping around
* `facet_grid()`: same idea, but can split by two variables along rows and columns (e.g. `facet_grid(gender ~ age_group)`)
* `theme_bw()`: replace default gray background with black-and-white

Layers are separated by a `+` sign. For clarity, I usually put each layer on a new line, unless it takes few or no arguments (e.g. `xlab()`, `ylab()`, `theme_bw()`).

---

# Aesthetics

**Aesthetics** control the appearance of the layers:

* `x`, `y`: `$x$` and `$y$` coordinate values to use
* `color`: set color of elements based on some data value
* `group`: describe which points are conceptually grouped together for the plot (often used with lines)
* `size`: set size of points/lines based on some data value
* `alpha`: set transparency based on some data value

---

## Aesthetics: Setting vs. mapping

Layers take arguments to control their appearance, such as point/line colors or transparency (`alpha` between 0 and 1).

* Arguments like `color`, `size`, `linetype`, `shape`, `fill`, and `alpha` can be used directly on the layers (**setting aesthetics**), e.g. `geom_point(color = "red")`. See the [`ggplot2` documentation](http://docs.ggplot2.org/current/vignettes/ggplot2-specs.html) for options. These *don't depend on the data*.

* Arguments inside `aes()` (**mapping aesthetics**) will *depend on the data*, e.g. `geom_point(aes(color = continent))`.

* `aes()` in the `ggplot()` layer gives overall aesthetics to use in other layers, but can be changed on individual layers (including switching `x` or `y` to different variables)

This may seem pedantic, but precise language makes searching for help easier.

Now let's see all this jargon in action.

---
## Axis Labels, Points, No Background

### 1: Base Plot

.pull-left[
 .small[

```r
*ggplot(data = China,
* aes(x = year, y = lifeExp))
```
]
]
.pull-right[
![](r_exposure_2_intermediate_files/figure-html/unnamed-chunk-82-1.svg)
]

.footnote[Initialize the plot with `ggplot()` and `x` and `y` aesthetics **mapped** to variables.]

---