R Exposure 1

class: center, top, title-slide

.title[
# R Exposure 1
]
.subtitle[
## RStudio and Basic R
]
.author[
### Charles Lanfear
]
.date[
### Jul 12, 2023 Updated: Jul 11, 2023
]

---

# Overview

1. R and RStudio Orientation

2. Packages

3. Creating and Using Objects

4. Dataframes and Indexing

5. Basic Analyses

6. Resources for Further Learning

---
class: inverse

# R and RStudio
## A quick orientation

---
# Why R?

R is a programming language built for statistical computing.

If one already knows Excel or Stata, why use R?

* R is *free*, so you don't need a terminal server or license.

* R has a *very* large community for support and packages.

* R can handle virtually any data format.

* R makes replication *easy*.

* R is a *language* so it can do *everything*.1

.footnote[[1] Including generate these slides (using RMarkdown)!]
--

* R is similar to other programming languages.

---

# R Studio

R Studio is a "front-end" or integrated development environment (IDE) for R that can make your life *easier*.

RStudio can:
--

* Organize your code, output, and plots.

* Auto-complete code and highlight syntax.

* Help view data and objects.

* [Enable easy integration of R code into documents.](https://rmarkdown.rstudio.com/)

---
# Getting Started

Open up RStudio now and choose *File > New File > R Script*.

Then, let's get oriented with the interface:

* *Top Left*: Code **editor** pane, data viewer (browse with tabs)

* *Bottom Left*: **Console** for running code (`>` prompt)

* *Top Right*: List of objects in **environment**, code **history** tab.

* *Bottom Right*: Tabs for browsing files, viewing plots, managing packages, and viewing help files.

You can change the layout in *Preferences > Pane Layout*

---

# Editing and Running Code

There are several ways to run R code in RStudio:
--

* Highlight lines in the **editor** window and click *Run* at the top or hit `Ctrl+Enter` or `⌘+Enter` to run them all.

* With your **caret** on a line you want to run, hit `Ctrl+Enter` or `⌘+Enter`. Note your caret moves to the next line, so you can run code sequentially with repeated presses.

* Type individual lines in the **console** and press `Enter`.

The console will show the lines you ran followed by any printed output.

---

# Incomplete Code

If you mess up (e.g. leave off a parenthesis), R might show a `+` sign prompting you to finish the command:

```r
> (11-2
+
```

Finish the command or hit `Esc` to get out of this.

---

# R as a Calculator

In the **console**, type `123 + 456 + 789` and hit `Enter`.
--

```r
123 + 456 + 789
```

```
## [1] 1368
```

The `[1]` in the output indicates the numeric **index** of the first element on that line.

Now in your blank R document in the **editor**, try typing the line `sqrt(400)` and either
clicking *Run* or hitting `Ctrl+Enter` or `⌘+Enter`.

```r
sqrt(400)
```

```
## [1] 20
```

---

# Functions and Help

`sqrt()` is an example of a **function** in R.

If we didn't have a good guess as to what `sqrt()` will do, we can type `?sqrt` in the console
and look at the **Help** panel on the right.

```r
?sqrt
```

**Arguments** are the *inputs* to a function. In this case, the only argument to `sqrt()`
is `x` which can be a number or a vector of numbers.

Help files provide documentation on how to use functions and what functions produce.

---

# Creating Objects

R stores *everything* as an **object**, including data, functions, models, and output.

Creating an object can be done using the **assignment operator**: `<-` 
--

```r
new.object <- 144
```

**Operators** like `<-` are functions that look like symbols but typically sit between their arguments 
(e.g. numbers or objects) instead of having them inside `()` like in `sqrt(x)`1.

.footnote[[1] We can actually call operators like other functions by stuffing them between backticks: <code>\`+\`(x,y)</code>]

We do math with operators, e.g., `x + y`. `+` is the addition operator!

---

# Calling Objects

You can display or "call" an object simply by using its name.

```r
new.object
```

```
## [1] 144
```

Object names can contain `_` and `.` in them, but cannot *begin* with numbers. Try
to be consistent in naming objects. RStudio auto-complete means *long names are better 
than vague ones*!

*Good names1 save confusion later.*

.footnote[[1] "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton]

---

# Using Objects

An object's **name** represents the information stored in that **object**, so you can treat the object's name
as if it were the values stored inside.
--

```r
new.object + 10
```

```
## [1] 154
```

```r
new.object + new.object
```

```
## [1] 288
```

```r
sqrt(new.object)
```

```
## [1] 12
```

---

# Creating Vectors

A **vector** is a series of **elements**, such as numbers.

You can create a vector and store it as an object in the same way. To do this, use the
function `c()` which stands for "combine" or "concatenate".
--

```r
new.object <- c(4, 9, 16, 25, 36)
new.object
```

```
## [1]  4  9 16 25 36
```

If you name an object the same name as an existing object, *it will overwrite it*.

You can provide a vector as an argument for many functions.
--

```r
sqrt(new.object)
```

```
## [1] 2 3 4 5 6
```

---

# Character Vectors

We often work with data that are categorical. To create a vector of text elements—**strings** in programming terms—we must place the text in quotes:

```r
string.vector <- c("Atlantic", "Pacific", "Arctic")
string.vector
```

```
## [1] "Atlantic" "Pacific"  "Arctic"
```

Categorical data can also be stored as a **factor**, which has an underlying numeric representation. Models will convert factors to dummies.1

```r
factor.vector <- factor(string.vector)
factor.vector
```

```
## [1] Atlantic Pacific  Arctic  
## Levels: Arctic Atlantic Pacific
```

.footnote[[1] Factors have **levels** which you can use to set a reference category in models using `relevel()`.]

---
# Saving and Loading Objects

You can save an R object on your computer as a file to open later:

```r
save(new.object, file="new_object.RData")
```

You can open saved files in R as well:

```r
load("new_object.RData")
```

But where are these files being saved and loaded from?

---

# Working Directories

R saves files and looks for files to open in your current **working directory**1. You
can ask R what this is:

.footnote[[1] For a simple R function to open an Explorer / Finder window at your working directory, [see this StackOverflow response](https://stackoverflow.com/a/12135823/10277284).]

```r
getwd()
```

```
## [1] "C:/Users/cclan/OneDrive/GitHub/r_exposure_workshop/lectures/r1"
```

Similarly, we can set a working directory like so:

```r
setwd("C:/Users/")
```

---

# More Complex Objects

The same principles shown with vectors can be used with more complex objects like **matrices**, **arrays**, **lists**, and **dataframes** (lists which look like matrices but can hold multiple data types at once).

Most data sets you will work with will be read into R and stored as a **dataframe**, so the remainder of this workshop will mainly focus on using these objects.

---
class: inverse

# Loading Dataframes

---

# Delimited Text Files

The easiest way to work with external data—that isn't in R format—is for it to be stored in a *delimited* text file, e.g. comma-separated values (**.csv**) or tab-separated values (**.tsv**).

R has a variety of built-in functions for importing data stored in text files, like `read.table()` and `read.csv()`.1

```r
new_df <- read.csv("some_spreadsheet.csv")
```

.footnote[
[1] Use "write" versions (e.g. `write.csv()`) to create these files from R objects.
]

---
# Data from Other Software

Working with **Stata**, **SPSS**, or **SAS** users? You can use a **package** to bring in their saved data files:
 
* `foreign`
    + Part of base R
    + Functions: `read.spss()`, `read.dta()`, `read.xport()`
    + Less complex but sometimes loses some metadata
* `haven`
    + Part of the `tidyverse` family
    + Functions: `read_spss()`, `read_dta()`, `read_sas()`
    + Keeps metadata like variable labels

For less common formats, Google it. I rarely encounter data formats without an 
R package to handle it (or at least a clever hack).

If you have an ambiguous file extension (e.g. `.dat`), try opening it with
a good text editor first (e.g. Atom, Sublime); there's a good chance it is actually raw text
with a delimiter or fixed format that R can handle!

---

# Installing Packages

Packages contain functions (and sometimes data) created by the community. The real power of R is found in add-on packages!

This workshop focuses on using packages from the [`tidyverse`](https://www.tidyverse.org/).

The `tidyverse` is a collection of R packages which share a design philosophy, syntax, and data structures.

The `tidyverse` includes the most used packages in the R world: [`dplyr`](https://dplyr.tidyverse.org/) and [`ggplot2`](https://ggplot2.tidyverse.org/)

You can install the *entire* `tidyverse` with the following:

```r
install.packages("tidyverse")
```

We will also use the `gapminder` and `nycflights13` datasets:

```r
install.packages("gapminder")
install.packages("nycflights13")
```

---

# Loading Packages

To load a package, use `library()`:

```r
library(gapminder)
```

Once a package is loaded, you can call on functions or data inside it.

```r
data(gapminder) # Places data in your global environment
head(gapminder) # Displays first six elements of an object
```

```
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
```

---
class: inverse
# Indexing and Subsetting
## Base R

---

# Indices and Dimensions

In base R, there are two main ways to access object elements: square brackets (`[]` or `[[]]`) and `$`. How you access an object depends on its *dimensions*.

Dataframes have *2* dimensions: **rows** and **columns**. Square brackets allow us to numerically **subset** in the format of `object[row, column]`. Leaving the row or column place empty selects *all* elements of that dimension.

.small[

```r
gapminder[1,] # First row
```

```
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
```
]
--
.small[

```r
*gapminder[1:3, 3:4] # First three rows, third and fourth column
```

```
## # A tibble: 3 × 2
## year lifeExp
## <int> <dbl>
## 1 1952 28.8
## 2 1957 30.3
## 3 1962 32.0
```
]

.pull-right[
.footnote[
The **colon operator** (`:`) generates a vector using the sequence of integers from its first argument to its second. `1:3` is equivalent to `c(1,2,3)`.
]
]
---

# Dataframes and Names

Columns in dataframes can also be accessed using their names with the `$` extract operator. This will return the column as a vector:

```r
gapminder$gdpPercap[1:10]
```

```
##  [1] 779.4453 820.8530 853.1007 836.1971 739.9811 786.1134 978.0114
##  [8] 852.3959 649.3414 635.3414
```

Note here I *also* used brackets to select just the first 10 elements of that column.

You can mix subsetting formats! In this case I provided only a single value (no column index) because **vectors** have *only one dimension* (length).

If you try to subset something and get a warning about "incorrect number of dimensions", check your subsetting!

---

# Indexing by Expression

We can also index using expressions—logical *tests*.

```r
gapminder[gapminder$year==1952, ]
```

```
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Albania Europe 1952 55.2 1282697 1601.
## 3 Algeria Africa 1952 43.1 9279525 2449.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Argentina Americas 1952 62.5 17876956 5911.
## 6 Australia Oceania 1952 69.1 8691212 10040.
## 7 Austria Europe 1952 66.8 6927772 6137.
## 8 Bahrain Asia 1952 50.9 120447 9867.
## 9 Bangladesh Asia 1952 37.5 46886859 684.
## 10 Belgium Europe 1952 68 8730405 8343.
## # ℹ 132 more rows
```

---

# How Expressions Work

What does `gapminder$year==1952` actually do?

```r
head(gapminder$year==1952, 50) # display first 50 elements
```

```
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE  TRUE FALSE
```

It returns a vector of `TRUE` or `FALSE` values.

When used with the subset operator (`[]`), elements for which a `TRUE` is given are returned while those corresponding to `FALSE` are dropped.

---

# Logical Operators

We used `==` for testing "equals": `gapminder$year==1952`.

There are many other [logical operators](http://www.statmethods.net/management/operators.html):

* `!=`: not equal to
--

* `>`, `>=`, `<`, `<=`: less than, less than or equal to, etc.
--

* `%in%`: used with checking equal to one of several values

Or we can combine multiple logical conditions:

* `&`: both conditions need to hold (AND)
--

* `|`: at least one condition needs to hold (OR)
--

* `!`: inverts a logical condition (`TRUE` becomes `FALSE`, `FALSE` becomes `TRUE`)

Logical operators are one of the foundations of programming. You should experiment with these to become familiar with how they work!

---

# Sidenote: Missing Values

Missing values are coded as `NA` entries without quotes:

```r
vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
```

Even one `NA` "poisons the well": You'll get `NA` out of your calculations unless you remove them manually or use the extra argument `na.rm = TRUE` in some functions:

```r
mean(vector_w_missing)
```

```
## [1] NA
```

```r
mean(vector_w_missing, na.rm=TRUE)
```

```
## [1] 3.6
```

---
# Finding Missing Values

**WARNING:** You can't test for missing values by seeing if they "equal" (`==`) `NA`:

```r
vector_w_missing == NA
```

```
## [1] NA NA NA NA NA NA NA
```

But you can use the `is.na()` function:

```r
is.na(vector_w_missing)
```

```
## [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
```

We can use subsetting to get the equivalent of `na.rm=TRUE`:

```r
*mean(vector_w_missing[!is.na(vector_w_missing)])
```

```
## [1] 3.6
```

.pull-right[
.footnote[
`!` *reverses* a logical condition. Read the above as "subset to *not* `NA`"
]
]

---
class: inverse

# Subsetting Data with `dplyr`

.image-full[
![](img/dplyr.svg)
]

---

# `dplyr`

`dplyr` is a Tidyverse package for working with data frames.

It provides an intuitive, powerful, and consistent alternative to base R for subsetting data.

It also provides functions for summarizing and joining data which are more straightforward than base R.

While I recommend all users be familiar with base R methods I've just covered, `dplyr` is the dominant platform for data manipulation in R, so we will focus on it for the remainder of this unit.

---
# But First, Pipes: `|>`

Tidyverse approaches to writing code use forward pipe operators, usually called simply a **pipe**. We write pipes like **`|>`**.

Pipes take the object on the *left* and apply the function on the *right*: `x |> f(y) = f(x, y)`. Read out loud: "and then..."

```r
library(dplyr)
gapminder |> filter(country == "Canada") |> head(2)
```

```
## # A tibble: 2 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
```

Pipes save us typing, make code readable, and allow chaining like above, so we use them *all the time* when manipulating data frames.

---

# Using Pipes

Pipes are clearest to read when you have each function on a separate line.

```r
take_this_data |>
    do_first_thing(with = this_value) |>
    do_next_thing(using = that_value) |> ...
```

Stuff to the left of the pipe is passed to the *first argument* of the function on the right. Other arguments go on the right in the function.

If you ever find yourself piping a function where data are not the first argument, use `_` in the data argument instead.

```r
gapminder |> lm(pop ~ year, data = _)
```

---
# Pipe Assignment

When creating a new object from the output of piped functions, place the assignment operator at the beginning.

```r
lm_pop_year <- gapminder |> 
 filter(continent == "Americas") |>
 lm(pop ~ year, data = _)
```

No matter how long the chain of functions is, assignment is done at the top.1

.footnote[[1] Note this is just a stylistic convention: If you prefer, you *can* do assignment at the end of the chain.]

---

# `filter()` Data Frames

I used **`filter()`** earlier. We subset *rows* of data using logical conditions with `filter()`!

```r
gapminder |> filter(country == "Oman") |> head(8)
```

```
## # A tibble: 8 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Oman Asia 1952 37.6 507833 1828.
## 2 Oman Asia 1957 40.1 561977 2243.
## 3 Oman Asia 1962 43.2 628164 2925.
## 4 Oman Asia 1967 47.0 714775 4721.
## 5 Oman Asia 1972 52.1 829050 10618.
## 6 Oman Asia 1977 57.4 1004533 11848.
## 7 Oman Asia 1982 62.7 1301048 12955.
## 8 Oman Asia 1987 67.7 1593882 18115.
```

What is this doing?

---

# Multiple Conditions Example

Let's say we want observations from Oman after 1980 and through 2000.

```r
gapminder |>
 filter(country == "Oman" &
 year > 1980 &
 year <= 2000 )
```

```
## # A tibble: 4 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Oman Asia 1982 62.7 1301048 12955.
## 2 Oman Asia 1987 67.7 1593882 18115.
## 3 Oman Asia 1992 71.2 1915208 18617.
## 4 Oman Asia 1997 72.5 2283635 19702.
```

---
# `%in%` Operator

Common use case: Filter rows to things in some *set*.

We can use `%in%` like `==` but for matching *any element* in the vector on its right1.

```r
*former_yugoslavia <- c("Bosnia and Herzegovina", "Croatia",
* "Montenegro", "Serbia", "Slovenia")
yugoslavia <- gapminder |> filter(country %in% former_yugoslavia)
tail(yugoslavia, 2)
```

```
## # A tibble: 2 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Slovenia Europe 2002 76.7 2011497 20660.
## 2 Slovenia Europe 2007 77.9 2009245 25768.
```

.footnote[[1] The `c()` function is how we make **vectors** in R, which are an important data type.]

---
## Sorting: `arrange()`

Along with filtering the data to see certain rows, we might want to sort it:

```r
yugoslavia |> arrange(year, desc(pop))
```

```
## # A tibble: 60 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Serbia Europe 1952 58.0 6860147 3581.
## 2 Croatia Europe 1952 61.2 3882229 3119.
## 3 Bosnia and Herzegovina Europe 1952 53.8 2791000 974.
## 4 Slovenia Europe 1952 65.6 1489518 4215.
## 5 Montenegro Europe 1952 59.2 413834 2648.
## 6 Serbia Europe 1957 61.7 7271135 4981.
## 7 Croatia Europe 1957 64.8 3991242 4338.
## 8 Bosnia and Herzegovina Europe 1957 58.4 3076000 1354.
## 9 Slovenia Europe 1957 67.8 1533070 5862.
## 10 Montenegro Europe 1957 61.4 442829 3682.
## # ℹ 50 more rows
```

The data are sorted by ascending `year` and descending `pop`.

---
## Keeping Columns: `select()`

Not only can we subset rows, but we can include specific columns (and put them in the order listed) using **`select()`**.

```r
yugoslavia |> select(country, year, pop) |> head(4)
```

```
## # A tibble: 4 × 3
## country year pop
## <fct> <int> <int>
## 1 Bosnia and Herzegovina 1952 2791000
## 2 Bosnia and Herzegovina 1957 3076000
## 3 Bosnia and Herzegovina 1962 3349000
## 4 Bosnia and Herzegovina 1967 3585000
```

---
## Dropping Columns: `select()`

We can instead drop only specific columns with `select()` using `-` signs:

```r
yugoslavia |> select(-continent, -pop, -lifeExp) |> head(4)
```

```
## # A tibble: 4 × 3
## country year gdpPercap
## <fct> <int> <dbl>
## 1 Bosnia and Herzegovina 1952 974.
## 2 Bosnia and Herzegovina 1957 1354.
## 3 Bosnia and Herzegovina 1962 1710.
## 4 Bosnia and Herzegovina 1967 2172.
```

---
## Helper Functions for `select()`

`select()` has a variety of helper functions like `starts_with()`, `ends_with()`, and `matches()`, or can be given a range of contiguous columns `startvar:endvar`. See `?select` for details.

These are very useful if you have a "wide" data frame with column names following a pattern or ordering.

![DYS Data Example](http://clanfear.github.io/CSSS508/Lectures/Week3/img/dys_vars.PNG)

```r
DYS |> select(starts_with("married"))
DYS |> select(ends_with("18"))
```

---
## `select(where())`

An especially useful helper for select is `where()` which can be used for selecting columns based on functions that check column types.

```r
gapminder |> select(where(is.numeric)) |> head(3)
```

```
## # A tibble: 3 × 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 28.8 8425333 779.
## 2 1957 30.3 9240934 821.
## 3 1962 32.0 10267083 853.
```

.pull-right[.footnote[`int` (integer) and `dbl` (double) are both types of `numeric` data.]]

```r
gapminder |> select(where(is.factor)) |> head(3)
```

```
## # A tibble: 3 × 2
## country continent
## <fct> <fct> 
## 1 Afghanistan Asia 
## 2 Afghanistan Asia 
## 3 Afghanistan Asia
```

---
## Renaming Columns with `select()`

We can rename columns using `select()`, but that drops everything that isn't mentioned:

```r
yugoslavia |>
    select(Life_Expectancy = lifeExp) |>
    head(4)
```

```
## # A tibble: 4 × 1
## Life_Expectancy
## <dbl>
## 1 53.8
## 2 58.4
## 3 61.9
## 4 64.8
```

---
### Safer: Rename Columns with `rename()`

**`rename()`** renames variables using the same syntax as `select()` without dropping unmentioned variables.

```r
yugoslavia |>
    select(country, year, lifeExp) |>
    rename(Life_Expectancy = lifeExp) |>
    head(4)
```

```
## # A tibble: 4 × 3
## country year Life_Expectancy
## <fct> <int> <dbl>
## 1 Bosnia and Herzegovina 1952 53.8
## 2 Bosnia and Herzegovina 1957 58.4
## 3 Bosnia and Herzegovina 1962 61.9
## 4 Bosnia and Herzegovina 1967 64.8
```

---
class: inverse
# Creating Variables

---
## `mutate()`

In `dplyr`, you can add new columns to a data frame using **`mutate()`**.

```r
yugoslavia |> filter(country == "Serbia") |>
    select(year, pop, lifeExp) |>
*   mutate(pop_million = pop / 1000000,
*          life_exp_past_40 = lifeExp - 40) |>
    head(5)
```

```
## # A tibble: 5 × 5
## year pop lifeExp pop_million life_exp_past_40
## <int> <int> <dbl> <dbl> <dbl>
## 1 1952 6860147 58.0 6.86 18.0
## 2 1957 7271135 61.7 7.27 21.7
## 3 1962 7616060 64.5 7.62 24.5
## 4 1967 7971222 66.9 7.97 26.9
## 5 1972 8313288 68.7 8.31 28.7
```

.footnote[Note you can create multiple variables in a single `mutate()` call by separating the expressions with commas.]

---
# `ifelse()`

A common function used in `mutate()` (and in general in R programming) is **`ifelse()`**. It returns a vector of values depending on a logical test.

```r
ifelse(test = x==y, yes = first_value , no = second_value)
```

Output from `ifelse()` if `x==y` is...
* `TRUE`: `first_value` - the value for `yes =`

* `FALSE`: `second_value` - the value for `no = `

* `NA`: `NA` - because you can't test for NA with an equality!

For example:

```r
example <- c(1, 0, NA, -2)
ifelse(example > 0, "Positive", "Not Positive")
```

```
## [1] "Positive"     "Not Positive" NA             "Not Positive"
```

---
# `ifelse()` Example

.smallish[

```r
yugoslavia |> mutate(short_country = 
                 ifelse(country == "Bosnia and Herzegovina", 
*                       "B and H", as.character(country))) |>
    select(country, short_country, year, pop) |>
    arrange(year, short_country) |> head(3)
```

```
## # A tibble: 3 × 4
## country short_country year pop
## <fct> <chr> <int> <int>
## 1 Bosnia and Herzegovina B and H 1952 2791000
## 2 Croatia Croatia 1952 3882229
## 3 Montenegro Montenegro 1952 413834
```
]

Read this as "For each row, if `country` equals 'Bosnia and Herzegovina, make `short_country` equal to 'B and H', otherwise make it equal to that row's value of `country`."

This is a simple way to change some values but not others!

Note: `country` is a factor--use `as.character()` to convert to character.

---
## `case_when()`

**`case_when()`** performs multiple `ifelse()` operations at the same time. `case_when()` allows you to create a new variable with values based on multiple logical statements. This is useful for making categorical variables or variables from combinations of other variables.
.smallish[

```r
gapminder |> 
 mutate(gdpPercap_ordinal = 
 case_when(
 gdpPercap < 700 ~ "low",
 gdpPercap >= 700 & gdpPercap < 800 ~ "moderate",
 TRUE ~ "high" )) |> # Value when all other statements are FALSE
 slice(6:9) # get rows 6 through 9
```

```
## # A tibble: 4 × 7
## country continent year lifeExp pop gdpPercap gdpPercap_ordinal
## <fct> <fct> <int> <dbl> <int> <dbl> <chr> 
## 1 Afghanis… Asia 1977 38.4 1.49e7 786. moderate 
## 2 Afghanis… Asia 1982 39.9 1.29e7 978. high 
## 3 Afghanis… Asia 1987 40.8 1.39e7 852. high 
## 4 Afghanis… Asia 1992 41.7 1.63e7 649. low
```
]

.footnote[Consider `case_match()` when only working with one variable]

---
class: inverse

# Analyses
## Basic Graphics and Models

---

# Histograms

We can use the `hist()` function to generate a histogram of a vector:

```r
hist(gapminder$lifeExp,
*    xlab = "Life Expectancy (years)",
*    main = "Observed Life Expectancies of Countries")
```

.pull-right[
.footnote[
`xlab =` is used to set the label of the x-axis of a plot.

`main = ` is used to set the title of a plot.

Use `?hist` to see additional options available for customizing a histogram.
]
]
---
# Scatter Plots

.small[

```r
*plot(lifeExp ~ gdpPercap, data = gapminder,
     xlab = "ln(GDP per Capita)",
     ylab = "Life Expectancy (years)",
     main = "Life Expectancy and log GDP per Capita",
     pch = 16, log="x") # log="x" sets x axis to log scale!
*abline(h = mean(gapminder$lifeExp), col = "firebrick")
*abline(v = mean(gapminder$gdpPercap), col = "cornflowerblue")
```

<img src="r_exposure_1_introduction_files/figure-html/unnamed-chunk-43-1.png" height="320px" />
]

.pull-right[
.footnote[Note that `lifeExp ~ gdpPercap` is a **formula** of the type `y ~ x`. The first element (`lifeExp`) gets plotted on the y-axis and the second (`gdpPercap`) goes on the x-axis.

The `abline()` calls place horizontal (`h =`) or vertical (`v =`) lines at the means of the variables used in the plot. 
]
]

---
# Formulae

Most modeling functions in R use a common formula format—the same seen with the previous plot:

```r
new_formula <- y ~ x1 + x2 + x3
new_formula
```

```
## y ~ x1 + x2 + x3
## <environment: 0x0000028e739631a8>
```

```r
class(new_formula)
```

```
## [1] "formula"
```

The dependent variable goes on the left side of `~` and independent variables go on the right.

See here for more on [formulae](https://www.datacamp.com/community/tutorials/r-formula-tutorial).

---

# Simple Tables

`table()` creates basic cross-tabulations of vectors.

```r
table(mtcars$cyl, mtcars$am)
```

```
##    
##      0  1
##   4  3  8
##   6  4  3
##   8 12  2
```

.footnote[Look at `tabyl()` in the `{janitor}` package for easy fancy tables]

---

# Chi-Square

We can give the output from `table()` to `chisq.test()` to perform a Chi-Square test of assocation.

```r
chisq.test(table(mtcars$cyl, mtcars$am))
```

```
## Warning in chisq.test(table(mtcars$cyl, mtcars$am)): Chi-squared
## approximation may be incorrect
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  table(mtcars$cyl, mtcars$am)
## X-squared = 8.7407, df = 2, p-value = 0.01265
```

Note the warning here. You can use rescaled (`rescale.p=TRUE`) or simulated p-values (`simulate.p.value=TRUE`) if desired.

---

# T Tests

T tests for mean comparisons are simple to do.

```r
gapminder$post_1980 <- ifelse(gapminder$year > 1980, 1, 2)
t.test(lifeExp ~ post_1980, data=gapminder)
```

```
## 
## 	Welch Two Sample t-test
## 
## data: lifeExp by post_1980
## t = 17.174, df = 1694.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## 8.791953 11.059068
## sample estimates:
## mean in group 1 mean in group 2 
## 64.43719 54.51168
```

---

# Linear Models

We can run an ordinary least squares linear regression using `lm()`:

```r
lm(lifeExp~pop + gdpPercap + year + continent, data=gapminder)
```

```
## 
## Call:
## lm(formula = lifeExp ~ pop + gdpPercap + year + continent, data = gapminder)
## 
## Coefficients:
##       (Intercept)                pop          gdpPercap  
##        -5.185e+02          1.791e-09          2.985e-04  
##              year  continentAmericas      continentAsia  
##         2.863e-01          1.429e+01          9.375e+00  
##   continentEurope   continentOceania  
##         1.936e+01          2.056e+01
```

Note we get a lot less output here than you may have expected! This is because we're only viewing a tiny bit of the information produced by `lm()`. We need to expore the object `lm()` creates!

---

# Model Summaries

The `summary()` function provides Stata-like regression output:

.smaller[

```r
lm_out <- lm(lifeExp~pop + gdpPercap + year + continent, data=gapminder)
summary(lm_out)
```

```
## 
## Call:
## lm(formula = lifeExp ~ pop + gdpPercap + year + continent, data = gapminder)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -28.4051 -4.0550 0.2317 4.5073 20.0217 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -5.185e+02 1.989e+01 -26.062 <2e-16 ***
## pop 1.791e-09 1.634e-09 1.096 0.273 
## gdpPercap 2.985e-04 2.002e-05 14.908 <2e-16 ***
## year 2.863e-01 1.006e-02 28.469 <2e-16 ***
## continentAmericas 1.429e+01 4.946e-01 28.898 <2e-16 ***
## continentAsia 9.375e+00 4.719e-01 19.869 <2e-16 ***
## continentEurope 1.936e+01 5.182e-01 37.361 <2e-16 ***
## continentOceania 2.056e+01 1.469e+00 13.995 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.883 on 1696 degrees of freedom
## Multiple R-squared: 0.7172,	Adjusted R-squared: 0.716 
## F-statistic: 614.5 on 7 and 1696 DF, p-value: < 2.2e-16
```
]

---

## Model Objects

`lm()` produces a lot more information than what is shown by `summary()` however. We can see the **str**ucture of `lm()` output using `str()`:

.smaller[

```r
str(lm_out)
```

```
## List of 13
##  $ coefficients : Named num [1:8] -5.18e+02 1.79e-09 2.98e-04 2.86e-01 1.43e+01 ...
##   ..- attr(*, "names")= chr [1:8] "(Intercept)" "pop" "gdpPercap" "year" ...
##  $ residuals    : Named num [1:1704] -21.1 -21.1 -20.8 -20.2 -19.6 ...
##   ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:1704] -2455.1 34.6 312.1 162.6 100.6 ...
##   ..- attr(*, "names")= chr [1:1704] "(Intercept)" "pop" "gdpPercap" "year" ...
##  $ rank         : int 8
##  $ fitted.values: Named num [1:1704] 49.9 51.4 52.8 54.3 55.7 ...
##   ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
##  $ assign       : int [1:8] 0 1 2 3 4 4 4 4
##  $ qr           :List of 5
##   ..$ qr   : num [1:1704, 1:8] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "assign")= int [1:8] 0 1 2 3 4 4 4 4
##   .. ..- attr(*, "contrasts")=List of 1
##   ..$ qraux: num [1:8] 1.02 1 1.02 1.01 1.01 ...
##   ..$ pivot: int [1:8] 1 2 3 4 5 6 7 8
##   ..$ tol  : num 1e-07
##   ..$ rank : int 8
##   ..- attr(*, "class")= chr "qr"
##   [list output truncated]
##  - attr(*, "class")= chr "lm"
```
]

.pull-right30[
.footnote[
`lm()` actually has an enormous quantity of output! This is a type of object called a **list**.
]
]
---

# Model Objects

We can access parts of `lm()` output using `$` like with dataframe names:

.small[

```r
lm_out$coefficients
```

```
##       (Intercept)               pop         gdpPercap 
##     -5.184555e+02      1.790640e-09      2.984892e-04 
##              year continentAmericas     continentAsia 
##      2.862583e-01      1.429204e+01      9.375486e+00 
##   continentEurope  continentOceania 
##      1.936120e+01      2.055921e+01
```
]

We can also do this with `summary()`, which provides additional statistics:

.small[

```r
summary(lm_out)$coefficients
```

```
##                        Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)       -5.184555e+02 1.989299e+01 -26.062215 3.248472e-126
## pop                1.790640e-09 1.634107e-09   1.095791  2.733256e-01
## gdpPercap          2.984892e-04 2.002178e-05  14.908225  2.522143e-47
## year               2.862583e-01 1.005523e-02  28.468586 4.800797e-146
## continentAmericas  1.429204e+01 4.945645e-01  28.898241 1.183161e-149
## continentAsia      9.375486e+00 4.718629e-01  19.869087  3.798275e-79
## continentEurope    1.936120e+01 5.182170e-01  37.361177 2.025551e-223
## continentOceania   2.055921e+01 1.469070e+00  13.994707  3.390781e-42
```
]

---

# ANOVA

ANOVAs can be fit and summarized just like `lm()`

```r
summary(aov(lifeExp ~ continent, data=gapminder))
```

```
## Df Sum Sq Mean Sq F value Pr(>F) 
## continent 4 139343 34836 408.7 <2e-16 ***
## Residuals 1699 144805 85 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# More Complex Models

R supports many more complex models, for example:

* `glm()` has syntax similar to `lm()` but adds a `family =` argument to specify model families and link functions like logistic regression
   + ex: `glm(x~y, family=binomial(link="logit"))`

* The `lme4` package adds hierarchical (multilevel) GLM models.

* `lavaan` fits structural equation models with intuitive syntax.

* `fixest` and `tseries` fits time series models.

Most of these other packages support mode summaries with `summary()` and all create output objects which can be accessed using `$`.

Because R is the dominant environment for statisticians, the universe of modeling tools in R is *enormous*. If you need to do it, it is probably in a package somewhere.

---
class: inverse
# End of Unit 1