Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Fundamentals

IQA Lecture 1

Charles Lanfear

16 Oct 2024
Updated: 14 Oct 2024

1 / 51

Today

  • Math Review

    • Variables, vectors, and matrices

    • Lines and curves

    • Derivatives

  • Programming

    • Indexing and Subsetting

    • Logical Expressions

2 / 51

Math Review

3 / 51

Variables

Variables are symbols representing sets of one or more elements which might take any number of values.

  • Letters like x, y, and z are commonly used to indicate variables.

    • e.g., x=3
4 / 51

Variables

Variables are symbols representing sets of one or more elements which might take any number of values.

  • Letters like x, y, and z are commonly used to indicate variables.

    • e.g., x=3
  • Capital letters ( X ) or letters with an index ( xi ) refer to variables with multiple values—that is, with dimensions.

    • e.g.: X=xi=(3,4,5)
    • Variables with one dimension (length) are sometimes called a vector
4 / 51

Variables

Variables are symbols representing sets of one or more elements which might take any number of values.

  • Letters like x, y, and z are commonly used to indicate variables.

    • e.g., x=3
  • Capital letters ( X ) or letters with an index ( xi ) refer to variables with multiple values—that is, with dimensions.

    • e.g.: X=xi=(3,4,5)
    • Variables with one dimension (length) are sometimes called a vector
  • Subscripts like xi are used to index elements of vectors.

    • i here is itself a variable that indicates the position indexed
4 / 51

Variables

Variables are symbols representing sets of one or more elements which might take any number of values.

  • Letters like x, y, and z are commonly used to indicate variables.

    • e.g., x=3
  • Capital letters ( X ) or letters with an index ( xi ) refer to variables with multiple values—that is, with dimensions.

    • e.g.: X=xi=(3,4,5)
    • Variables with one dimension (length) are sometimes called a vector
  • Subscripts like xi are used to index elements of vectors.

    • i here is itself a variable that indicates the position indexed

    • x1=3, x2=4, x3=5

4 / 51

Vectors in R

x <- c(3, 4, 5) # Create x, a vector of length 3
x
## [1] 3 4 5
5 / 51

Vectors in R

x <- c(3, 4, 5) # Create x, a vector of length 3
x
## [1] 3 4 5

Indexing x3:

x[3] # Get the third element of x
## [1] 5
5 / 51

Vectors in R

x <- c(3, 4, 5) # Create x, a vector of length 3
x
## [1] 3 4 5

Indexing x3:

x[3] # Get the third element of x
## [1] 5

We can index multiple elements.

Index x2 and x3:

x[c(2,3)] # Get the second and third elements of x
## [1] 4 5
5 / 51

Matrices

Matrices are rectangular tables of numbers. They're typically indicated by a capital letter, e.g., X

Xi,j=[x1,1x1,2x2,1x2,2]

Matrices are indexed with subscripts for rows, columns (e.g., xij)

6 / 51

Matrices

Matrices are rectangular tables of numbers. They're typically indicated by a capital letter, e.g., X

Xi,j=[x1,1x1,2x2,1x2,2]

Matrices are indexed with subscripts for rows, columns (e.g., xij)

X=[3546]x2,1=4

6 / 51

Matrices

Matrices are rectangular tables of numbers. They're typically indicated by a capital letter, e.g., X

Xi,j=[x1,1x1,2x2,1x2,2]

Matrices are indexed with subscripts for rows, columns (e.g., xij)

X=[3546]x2,1=4

 

X=[623195480]What is x3,2?

6 / 51

Matrices in R

(X <- matrix(c(6,1,4,2,9,8,3,5,0), nrow = 3))
## [,1] [,2] [,3]
## [1,] 6 2 3
## [2,] 1 9 5
## [3,] 4 8 0

Note R shows indices on the margins to tell you how to subset.

7 / 51

Matrices in R

(X <- matrix(c(6,1,4,2,9,8,3,5,0), nrow = 3))
## [,1] [,2] [,3]
## [1,] 6 2 3
## [2,] 1 9 5
## [3,] 4 8 0

Note R shows indices on the margins to tell you how to subset.

X[3,2] # Third row, second column
## [1] 8
7 / 51

Matrices in R

(X <- matrix(c(6,1,4,2,9,8,3,5,0), nrow = 3))
## [,1] [,2] [,3]
## [1,] 6 2 3
## [2,] 1 9 5
## [3,] 4 8 0

Note R shows indices on the margins to tell you how to subset.

X[3,2] # Third row, second column
## [1] 8

We can take multiple elements of a matrix too (and shake up the order):

X[3,c(2,1)] # Third row, second and first column
## [1] 8 4
7 / 51

Summation

ni=1xi

"Sum all values of x from the first ( i=1 ) until the last ( n )"

8 / 51

Summation

ni=1xi

"Sum all values of x from the first ( i=1 ) until the last ( n )"

Given x=[7,11,11,13,26]:

ni=1xi=x1+x2+x3+x4+x5=7+11+11+13+26=68

8 / 51

Summation

ni=1xi

"Sum all values of x from the first ( i=1 ) until the last ( n )"

Given x=[7,11,11,13,26]:

ni=1xi=x1+x2+x3+x4+x5=7+11+11+13+26=68

x <- c(7, 11, 11, 13, 26)
sum(x)
## [1] 68

Often when summing all elements of a vector, the sub/super scripts are hidden

  • e.g. xi=ni=1xi
8 / 51

Measures of Central Tendency

Otherwise known as averages

 

9 / 51

Mean

The (arithmetic) mean is the expected value of a variable.

ˉx=1nni=1xi

10 / 51

Mean

The (arithmetic) mean is the expected value of a variable.

ˉx=1nni=1xi

If you draw randomly from that variable, the mean would be the least wrong single guess you could make about that value.1

Put another way, the positive and negative differences between all values and the mean balance out.

[1] Technically the mean minimizes the squared error.

10 / 51

Mean

The (arithmetic) mean is the expected value of a variable.

ˉx=1nni=1xi

If you draw randomly from that variable, the mean would be the least wrong single guess you could make about that value.1

Put another way, the positive and negative differences between all values and the mean balance out.

[1] Technically the mean minimizes the squared error.

(1/length(x)) * sum(x) # Mean formula
## [1] 13.6
mean(x) # Mean function is just a shortcut
## [1] 13.6
10 / 51

Median

The value for which no more than half of the values are either higher or lower.1

[1] Technically the median minimizes the absolute error.

11 / 51

Median

The value for which no more than half of the values are either higher or lower.1

[1] Technically the median minimizes the absolute error.

The median has this heinous formula:

m(xi)={xn+12,if n odd12(xn2+xn2+1),if n even

"If x has an odd number of elements, when put them in order, the median is the middle value. If x has an even number of elements, the median is the mean of the middle two."

11 / 51

Median

The value for which no more than half of the values are either higher or lower.1

[1] Technically the median minimizes the absolute error.

The median has this heinous formula:

m(xi)={xn+12,if n odd12(xn2+xn2+1),if n even

"If x has an odd number of elements, when put them in order, the median is the middle value. If x has an even number of elements, the median is the mean of the middle two."

sort(x)[(length(x) + 1) / 2]
## [1] 11
median(x)
## [1] 11
11 / 51

Mode

The mode is the most frequent value in the variable.

12 / 51

Mode

The mode is the most frequent value in the variable.

There are formulas for the mode, but they aren't very intuitive, despite it being the most intuitive measure of central tendency.

12 / 51

Mode

The mode is the most frequent value in the variable.

There are formulas for the mode, but they aren't very intuitive, despite it being the most intuitive measure of central tendency.

You can use a table() to see frequencies of values:

table(x)
## x
## 7 11 13 26
## 1 2 1 1
12 / 51

Mode

The mode is the most frequent value in the variable.

There are formulas for the mode, but they aren't very intuitive, despite it being the most intuitive measure of central tendency.

You can use a table() to see frequencies of values:

table(x)
## x
## 7 11 13 26
## 1 2 1 1

And you can find the mode directly (we'll learn this later today!):

table(x)[table(x) == max(table(x))] # Subsetting!
## 11
## 2

"Subset the table to when the count is equal to the highest value."

12 / 51

Extreme Values

The mean is sensitive to extreme values:

z <- c(2, 5, 3, 5, 95)
mean(z)
## [1] 22
13 / 51

Extreme Values

The mean is sensitive to extreme values:

z <- c(2, 5, 3, 5, 95)
mean(z)
## [1] 22

The median is not:

median(z)
## [1] 5
13 / 51

Extreme Values

The mean is sensitive to extreme values:

z <- c(2, 5, 3, 5, 95)
mean(z)
## [1] 22

The median is not:

median(z)
## [1] 5

This means the median may be a more useful "average" when your data have extreme values.

This is common with things like income or self-reported number of crimes committed—these always have clumping that makes the mode misleading (e.g., many zeroes)!

13 / 51

Measures of Dispersion

How spread out something is

High Dispersion

Low Dispersion

14 / 51

Variance

The variance ( s2 ) measures how dispersed data are around the mean. Typically we use the sample variance:

s2=(xiˉx)2n1

We'll see this in action next week when we look at distributions.

15 / 51

Variance

The variance ( s2 ) measures how dispersed data are around the mean. Typically we use the sample variance:

s2=(xiˉx)2n1

We'll see this in action next week when we look at distributions.

(s2 <- sum((x - mean(x))^2) / (length(x) -1))
## [1] 52.8
var(x)
## [1] 52.8
15 / 51

Variance

The variance ( s2 ) measures how dispersed data are around the mean. Typically we use the sample variance:

s2=(xiˉx)2n1

We'll see this in action next week when we look at distributions.

(s2 <- sum((x - mean(x))^2) / (length(x) -1))
## [1] 52.8
var(x)
## [1] 52.8

If every value is the same, the variance is zero—the data would be invariant.

15 / 51

Standard Deviation

The standard deviation ( s or sd ) is just the square root of the variance:

s=sd=s2

16 / 51

Standard Deviation

The standard deviation ( s or sd ) is just the square root of the variance:

s=sd=s2

You can interpret it as the "typical" distance of values in the data from the mean.

16 / 51

Standard Deviation

The standard deviation ( s or sd ) is just the square root of the variance:

s=sd=s2

You can interpret it as the "typical" distance of values in the data from the mean.

sqrt(var(x))
## [1] 7.266361
sd(x)
## [1] 7.266361

Values are about 7.3 away from the mean on average

16 / 51

Lines

 

 

17 / 51

Cartesian Plane

18 / 51

Slope-Intercept

Mathematically, lines can be defined by a slope and an intercept.

19 / 51

Slope-Intercept

Mathematically, lines can be defined by a slope and an intercept.

You've seen this before, perhaps many moons ago:

y=mx+b

19 / 51

Slope-Intercept

Mathematically, lines can be defined by a slope and an intercept.

You've seen this before, perhaps many moons ago:

y=mx+b

We'll restate it this way:

y=a+bx

19 / 51

Slope-Intercept

Mathematically, lines can be defined by a slope and an intercept.

You've seen this before, perhaps many moons ago:

y=mx+b

We'll restate it this way:

y=a+bx

a is the intercept

  • The value of y when x=0
19 / 51

Slope-Intercept

Mathematically, lines can be defined by a slope and an intercept.

You've seen this before, perhaps many moons ago:

y=mx+b

We'll restate it this way:

y=a+bx

a is the intercept

  • The value of y when x=0

b is the slope

  • The units of y the line rises for every unit increase in x
  • You can restate this as the ratio that y increases relative to x
19 / 51

Intercept

y=1+0.5x

plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")
abline(a = 1, b = 0.5)

The line intercepts the y-axis at 1.

20 / 51

Intercept

y=3+0.5x

plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")
abline(a = 3, b = 0.5)

The line intercepts the y-axis at 3.

21 / 51

Intercept

y=2+0.5x

plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")
abline(a = 2, b = 0.5)

The line intercepts the y-axis at 2.

22 / 51

Slope

y=2+0.5x

plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")
abline(a = 2, b = 0.5)

From 2, the line increases by 0.5 for every x.

23 / 51

Slope

y=2+0x

plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")
abline(a = 2, b = 0)

From 2, the line increases by 0 for every x.

24 / 51

Slope

y=2+2x

plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")
abline(a = 2, b = 2)

From 2, the line increases by 2 for every x.

25 / 51

A little bit of calculus

 

26 / 51

Derivatives

  • The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)
27 / 51

Derivatives

  • The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)

  • Interpret d as "a little bit of"

27 / 51

Derivatives

  • The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)

  • Interpret d as "a little bit of"

  • dydx is the little increase in y given a little increase in x at any given point of the function that generates y (e.g., y=2x)
27 / 51

Derivatives

  • The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)

  • Interpret d as "a little bit of"

  • dydx is the little increase in y given a little increase in x at any given point of the function that generates y (e.g., y=2x)
  • For a straight line, this is the same everywhere—it has a constant slope.
27 / 51

Derivatives

  • The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)

  • Interpret d as "a little bit of"

  • dydx is the little increase in y given a little increase in x at any given point of the function that generates y (e.g., y=2x)
  • For a straight line, this is the same everywhere—it has a constant slope.

  • For curves, the slope is different depending on where on the curve you're looking.

27 / 51

Derivatives

  • The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)

  • Interpret d as "a little bit of"

  • dydx is the little increase in y given a little increase in x at any given point of the function that generates y (e.g., y=2x)
  • For a straight line, this is the same everywhere—it has a constant slope.

  • For curves, the slope is different depending on where on the curve you're looking.

  • A derivative lets us find exactly what that slope is wherever we want to look

27 / 51

Polynomial curves

You can define a curve in the same line formula:

y=2+0.5x+0.25x2

28 / 51

Polynomial curves

You can define a curve in the same line formula:

y=2+0.5x+0.25x2

A squared or quadratic term (e.g. x2) creates a parabola.

curve(2 + 0.5*x + 0.25*x^2, from = -2, to = 2, ylab = "y")

28 / 51

Polynomial curves

You can define a curve in the same line formula:

y=2+0.5x+0.25x2

A squared or quadratic term (e.g. x2) creates a parabola.

curve(2 + 0.5*x + 0.25*x^2, from = -2, to = 2, ylab = "y")

What is the slope of this curve?

28 / 51

Taking the Derivative

While a curve has many different slopes, all those slopes can be defined by a single derivative1

[1] At least for any curves we're going to talk about!

29 / 51

Taking the Derivative

While a curve has many different slopes, all those slopes can be defined by a single derivative1

[1] At least for any curves we're going to talk about!

Given y=a+xn, dydx=nxn1

Basic rules:

  • Delete any terms without x (e.g., a gets dropped)
  • Premultiply by exponents, then divide by x (e.g., x3 becomes 3x2 )
29 / 51

Taking the Derivative

While a curve has many different slopes, all those slopes can be defined by a single derivative1

[1] At least for any curves we're going to talk about!

Given y=a+xn, dydx=nxn1

Basic rules:

  • Delete any terms without x (e.g., a gets dropped)
  • Premultiply by exponents, then divide by x (e.g., x3 becomes 3x2 )

So...

  • y=2+0.5x+0.25x2
  • dydx=0.5+0.5x
  • When x=3 the slope is 0.5+0.53=2.
29 / 51

Cubic Derivative

These rules work for polynomials with more terms.

30 / 51

Cubic Derivative

These rules work for polynomials with more terms.

y=35+3x+0.5x2+0.25x3

30 / 51

Cubic Derivative

These rules work for polynomials with more terms.

y=35+3x+0.5x2+0.25x3

  1. Drop the constant (35)
  2. Multiple the coefficients by the exponents
  3. Divide by x (i.e., reduce the exponents by 1)
30 / 51

Cubic Derivative

These rules work for polynomials with more terms.

y=35+3x+0.5x2+0.25x3

  1. Drop the constant (35)
  2. Multiple the coefficients by the exponents
  3. Divide by x (i.e., reduce the exponents by 1)

dydx=3+x+0.75x2

30 / 51

Cubic Derivative

These rules work for polynomials with more terms.

y=35+3x+0.5x2+0.25x3

  1. Drop the constant (35)
  2. Multiple the coefficients by the exponents
  3. Divide by x (i.e., reduce the exponents by 1)

dydx=3+x+0.75x2

When x=2 the slope is...

x <- 2
3 + x + 0.75*x^2
## [1] 8
30 / 51

Why am I learning this?

31 / 51

Reasons

32 / 51

Reasons

It will be clear soon, but for now:

32 / 51

Reasons

It will be clear soon, but for now:

  • Most statistical models are estimators of conditional means, medians, or modes

    • e.g., the mean of y when x takes some value
32 / 51

Reasons

It will be clear soon, but for now:

  • Most statistical models are estimators of conditional means, medians, or modes

    • e.g., the mean of y when x takes some value
  • Model uncertainty is estimated using variances

32 / 51

Reasons

It will be clear soon, but for now:

  • Most statistical models are estimators of conditional means, medians, or modes

    • e.g., the mean of y when x takes some value
  • Model uncertainty is estimated using variances

  • Models are estimated using matrices and calculus

    • Which I won't make you do manually
32 / 51

Reasons

It will be clear soon, but for now:

  • Most statistical models are estimators of conditional means, medians, or modes

    • e.g., the mean of y when x takes some value
  • Model uncertainty is estimated using variances

  • Models are estimated using matrices and calculus

    • Which I won't make you do manually
  • The most used model parameters tell us how y changes when x changes

    • e.g., coefficients or marginal effects
32 / 51

Reasons

It will be clear soon, but for now:

  • Most statistical models are estimators of conditional means, medians, or modes

    • e.g., the mean of y when x takes some value
  • Model uncertainty is estimated using variances

  • Models are estimated using matrices and calculus

    • Which I won't make you do manually
  • The most used model parameters tell us how y changes when x changes

    • e.g., coefficients or marginal effects
  • Those are derivatives

32 / 51

Okay, that's enough maths

33 / 51

Oh, thank god

34 / 51

Time for more code

35 / 51

 

BOO

36 / 51

Indexing and Subsetting

Base R

37 / 51

Indices and Dimensions

There are two main ways index objects: square brackets ([] or [[]]) and $. How you access an object depends on its dimensions.

38 / 51

Indices and Dimensions

There are two main ways index objects: square brackets ([] or [[]]) and $. How you access an object depends on its dimensions.

Dataframes have 2 dimensions: rows and columns. Brackets subset using object[row, column]. Leaving the row or column place empty selects all elements of that dimension.

USArrests[1,] # First row
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
38 / 51

Indices and Dimensions

There are two main ways index objects: square brackets ([] or [[]]) and $. How you access an object depends on its dimensions.

Dataframes have 2 dimensions: rows and columns. Brackets subset using object[row, column]. Leaving the row or column place empty selects all elements of that dimension.

USArrests[1,] # First row
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
USArrests[1:3, 3:4] # First three rows, third and fourth column
## UrbanPop Rape
## Alabama 58 21.2
## Alaska 48 44.5
## Arizona 80 31.0

The colon operator (:) generates a vector using the sequence of integers from its first argument to its second. 1:3 is equivalent to c(1,2,3).

38 / 51

Using Names

We can also subset using the names of rows or columns:

USArrests["California",]
## Murder Assault UrbanPop Rape
## California 9 276 91 40.6
39 / 51

Using Names

We can also subset using the names of rows or columns:

USArrests["California",]
## Murder Assault UrbanPop Rape
## California 9 276 91 40.6
head(USArrests[, c("Murder", "UrbanPop")])
## Murder UrbanPop
## Alabama 13.2 58
## Alaska 10.0 48
## Arizona 8.1 80
## Arkansas 8.8 50
## California 9.0 91
## Colorado 7.9 78
39 / 51

Single columns

If you subset to a single column, it returns it as a vector instead of a dataframe:

USArrests[, "Murder"]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6
## [13] 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1
## [25] 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0 0.8 7.3 6.6
## [37] 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 8.5 4.0 5.7
## [49] 2.6 6.8
40 / 51

Single columns

If you subset to a single column, it returns it as a vector instead of a dataframe:

USArrests[, "Murder"]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6
## [13] 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1
## [25] 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0 0.8 7.3 6.6
## [37] 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 8.5 4.0 5.7
## [49] 2.6 6.8

Columns in dataframes can also be accessed using names with the $ extract operator:

USArrests$Murder
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6
## [13] 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1
## [25] 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0 0.8 7.3 6.6
## [37] 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 8.5 4.0 5.7
## [49] 2.6 6.8
40 / 51

Extract: $

You may have noticed $ before when we used str():

str(USArrests)
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Like the matrix subsetting suggestions, it is a hint you can select columns that way.

41 / 51

Mix and Match

USArrests$Murder[1:10]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4

Note here I also used brackets to select just the first 10 elements of that column.

42 / 51

Mix and Match

USArrests$Murder[1:10]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4

Note here I also used brackets to select just the first 10 elements of that column.

You can mix subsetting formats! In this case I provided only a single value (no column index) because vectors have only one dimension (length).

  • R first processes the $Murder
  • Then it processes the [1:10]
42 / 51

Mix and Match

USArrests$Murder[1:10]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4

Note here I also used brackets to select just the first 10 elements of that column.

You can mix subsetting formats! In this case I provided only a single value (no column index) because vectors have only one dimension (length).

  • R first processes the $Murder
  • Then it processes the [1:10]

If you try to subset something and get a warning about "incorrect number of dimensions", check your subsetting!

42 / 51

Logical Expressions

 

 

43 / 51

Indexing by Expression

We can also index using expressions—logical tests.

USArrests[USArrests$Murder > 15, ]
## Murder Assault UrbanPop Rape
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Louisiana 15.4 249 66 22.2
## Mississippi 16.1 259 44 17.1
44 / 51

Indexing by Expression

We can also index using expressions—logical tests.

USArrests[USArrests$Murder > 15, ]
## Murder Assault UrbanPop Rape
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Louisiana 15.4 249 66 22.2
## Mississippi 16.1 259 44 17.1

What does this give us?

44 / 51

How Expressions Work

What does USArrests$Murder > 15 actually do?

45 / 51

How Expressions Work

What does USArrests$Murder > 15 actually do?

USArrests$Murder > 15
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [21] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
45 / 51

How Expressions Work

What does USArrests$Murder > 15 actually do?

USArrests$Murder > 15
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [21] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

It returns a vector of TRUE or FALSE values.

When used with the subset operator ([]), elements for which a TRUE is given are returned while those corresponding to FALSE are dropped.

45 / 51

How Expressions Work

What does USArrests$Murder > 15 actually do?

USArrests$Murder > 15
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [21] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

It returns a vector of TRUE or FALSE values.

When used with the subset operator ([]), elements for which a TRUE is given are returned while those corresponding to FALSE are dropped.

c(1,2,3,4)[c(TRUE, FALSE, TRUE, FALSE)]
## [1] 1 3
45 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
  • |: at least one condition needs to hold (OR)
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
  • |: at least one condition needs to hold (OR)
  • !: inverts a logical condition (TRUE becomes FALSE, FALSE becomes TRUE)
46 / 51

Logical Operators

We used > for testing "greater than": USArrests$Murder > 15.

There are many other logical operators:

  • ==: equal to
  • !=: not equal to
  • >, >=, <, <=: less than, less than or equal to, etc.
  • %in%: used with checking equal to one of several values

Or we can combine multiple logical conditions:

  • &: both conditions need to hold (AND)
  • |: at least one condition needs to hold (OR)
  • !: inverts a logical condition (TRUE becomes FALSE, FALSE becomes TRUE)

Logical operators are one of the foundations of programming. You should experiment with these to become familiar with how they work!

46 / 51

And: &

USArrests[USArrests$Murder > 15 & USArrests$Assault > 300, ]
## Murder Assault UrbanPop Rape
## Florida 15.4 335 80 31.9
47 / 51

Or: |

USArrests[USArrests$Murder > 15 | USArrests$Assault > 300, ]
## Murder Assault UrbanPop Rape
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Louisiana 15.4 249 66 22.2
## Mississippi 16.1 259 44 17.1
## North Carolina 13.0 337 45 16.1
48 / 51

Sidenote: Missing Values

Missing values are coded as NA entries without quotes:

vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
49 / 51

Sidenote: Missing Values

Missing values are coded as NA entries without quotes:

vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)

Even one NA "poisons the well": You'll get NA out of your calculations unless you remove them manually or use the extra argument na.rm = TRUE in some functions:

mean(vector_w_missing)
## [1] NA
49 / 51

Sidenote: Missing Values

Missing values are coded as NA entries without quotes:

vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)

Even one NA "poisons the well": You'll get NA out of your calculations unless you remove them manually or use the extra argument na.rm = TRUE in some functions:

mean(vector_w_missing)
## [1] NA

We can take missings (NA) and remove (rm) them:

mean(vector_w_missing, na.rm=TRUE)
## [1] 3.6
49 / 51

Finding Missing Values

WARNING: You can't test for missing values by seeing if they "equal" (==) NA:

vector_w_missing == NA
## [1] NA NA NA NA NA NA NA
50 / 51

Finding Missing Values

WARNING: You can't test for missing values by seeing if they "equal" (==) NA:

vector_w_missing == NA
## [1] NA NA NA NA NA NA NA

But you can use the is.na() function:

is.na(vector_w_missing)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
50 / 51

Finding Missing Values

WARNING: You can't test for missing values by seeing if they "equal" (==) NA:

vector_w_missing == NA
## [1] NA NA NA NA NA NA NA

But you can use the is.na() function:

is.na(vector_w_missing)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE

We can use subsetting to get the equivalent of na.rm=TRUE:

mean(vector_w_missing[!is.na(vector_w_missing)])
## [1] 3.6

! reverses a logical condition. Read the above as "subset to not NA"

50 / 51

For Next Time

  • Read Kaplan chapters 3 and 4

  • Try a bit more swirl

51 / 51

Today

  • Math Review

    • Variables, vectors, and matrices

    • Lines and curves

    • Derivatives

  • Programming

    • Indexing and Subsetting

    • Logical Expressions

2 / 51
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow