Math Review
Variables, vectors, and matrices
Lines and curves
Derivatives
Programming
Indexing and Subsetting
Logical Expressions
Variables are symbols representing sets of one or more elements which might take any number of values.
Letters like x, y, and z are commonly used to indicate variables.
Variables are symbols representing sets of one or more elements which might take any number of values.
Letters like x, y, and z are commonly used to indicate variables.
Capital letters ( X ) or letters with an index ( xi ) refer to variables with multiple values—that is, with dimensions.
Variables are symbols representing sets of one or more elements which might take any number of values.
Letters like x, y, and z are commonly used to indicate variables.
Capital letters ( X ) or letters with an index ( xi ) refer to variables with multiple values—that is, with dimensions.
Subscripts like xi are used to index elements of vectors.
Variables are symbols representing sets of one or more elements which might take any number of values.
Letters like x, y, and z are commonly used to indicate variables.
Capital letters ( X ) or letters with an index ( xi ) refer to variables with multiple values—that is, with dimensions.
Subscripts like xi are used to index elements of vectors.
i here is itself a variable that indicates the position indexed
x1=3, x2=4, x3=5
x <- c(3, 4, 5) # Create x, a vector of length 3x
## [1] 3 4 5
x <- c(3, 4, 5) # Create x, a vector of length 3x
## [1] 3 4 5
Indexing x3:
x[3] # Get the third element of x
## [1] 5
x <- c(3, 4, 5) # Create x, a vector of length 3x
## [1] 3 4 5
Indexing x3:
x[3] # Get the third element of x
## [1] 5
We can index multiple elements.
Index x2 and x3:
x[c(2,3)] # Get the second and third elements of x
## [1] 4 5
Matrices are rectangular tables of numbers. They're typically indicated by a capital letter, e.g., X
Xi,j=[x1,1x1,2x2,1x2,2]
Matrices are indexed with subscripts for rows, columns (e.g., xij)
Matrices are rectangular tables of numbers. They're typically indicated by a capital letter, e.g., X
Xi,j=[x1,1x1,2x2,1x2,2]
Matrices are indexed with subscripts for rows, columns (e.g., xij)
X=[3546]x2,1=4
Matrices are rectangular tables of numbers. They're typically indicated by a capital letter, e.g., X
Xi,j=[x1,1x1,2x2,1x2,2]
Matrices are indexed with subscripts for rows, columns (e.g., xij)
X=[3546]x2,1=4
X=[623195480]What is x3,2?
(X <- matrix(c(6,1,4,2,9,8,3,5,0), nrow = 3))
## [,1] [,2] [,3]## [1,] 6 2 3## [2,] 1 9 5## [3,] 4 8 0
Note R shows indices on the margins to tell you how to subset.
(X <- matrix(c(6,1,4,2,9,8,3,5,0), nrow = 3))
## [,1] [,2] [,3]## [1,] 6 2 3## [2,] 1 9 5## [3,] 4 8 0
Note R shows indices on the margins to tell you how to subset.
X[3,2] # Third row, second column
## [1] 8
(X <- matrix(c(6,1,4,2,9,8,3,5,0), nrow = 3))
## [,1] [,2] [,3]## [1,] 6 2 3## [2,] 1 9 5## [3,] 4 8 0
Note R shows indices on the margins to tell you how to subset.
X[3,2] # Third row, second column
## [1] 8
We can take multiple elements of a matrix too (and shake up the order):
X[3,c(2,1)] # Third row, second and first column
## [1] 8 4
n∑i=1xi
"Sum all values of x from the first ( i=1 ) until the last ( n )"
n∑i=1xi
"Sum all values of x from the first ( i=1 ) until the last ( n )"
Given x=[7,11,11,13,26]:
n∑i=1xi=x1+x2+x3+x4+x5=7+11+11+13+26=68
n∑i=1xi
"Sum all values of x from the first ( i=1 ) until the last ( n )"
Given x=[7,11,11,13,26]:
n∑i=1xi=x1+x2+x3+x4+x5=7+11+11+13+26=68
x <- c(7, 11, 11, 13, 26)sum(x)
## [1] 68
Often when summing all elements of a vector, the sub/super scripts are hidden
The (arithmetic) mean is the expected value of a variable.
ˉx=1nn∑i=1xi
The (arithmetic) mean is the expected value of a variable.
ˉx=1nn∑i=1xi
If you draw randomly from that variable, the mean would be the least wrong single guess you could make about that value.1
Put another way, the positive and negative differences between all values and the mean balance out.
[1] Technically the mean minimizes the squared error.
The (arithmetic) mean is the expected value of a variable.
ˉx=1nn∑i=1xi
If you draw randomly from that variable, the mean would be the least wrong single guess you could make about that value.1
Put another way, the positive and negative differences between all values and the mean balance out.
[1] Technically the mean minimizes the squared error.
(1/length(x)) * sum(x) # Mean formula
## [1] 13.6
mean(x) # Mean function is just a shortcut
## [1] 13.6
The value for which no more than half of the values are either higher or lower.1
[1] Technically the median minimizes the absolute error.
The value for which no more than half of the values are either higher or lower.1
[1] Technically the median minimizes the absolute error.
The median has this heinous formula:
m(xi)={xn+12,if n odd12(xn2+xn2+1),if n even
"If x has an odd number of elements, when put them in order, the median is the middle value. If x has an even number of elements, the median is the mean of the middle two."
The value for which no more than half of the values are either higher or lower.1
[1] Technically the median minimizes the absolute error.
The median has this heinous formula:
m(xi)={xn+12,if n odd12(xn2+xn2+1),if n even
"If x has an odd number of elements, when put them in order, the median is the middle value. If x has an even number of elements, the median is the mean of the middle two."
sort(x)[(length(x) + 1) / 2]
## [1] 11
median(x)
## [1] 11
The mode is the most frequent value in the variable.
The mode is the most frequent value in the variable.
There are formulas for the mode, but they aren't very intuitive, despite it being the most intuitive measure of central tendency.
The mode is the most frequent value in the variable.
There are formulas for the mode, but they aren't very intuitive, despite it being the most intuitive measure of central tendency.
You can use a table()
to see frequencies of values:
table(x)
## x## 7 11 13 26 ## 1 2 1 1
The mode is the most frequent value in the variable.
There are formulas for the mode, but they aren't very intuitive, despite it being the most intuitive measure of central tendency.
You can use a table()
to see frequencies of values:
table(x)
## x## 7 11 13 26 ## 1 2 1 1
And you can find the mode directly (we'll learn this later today!):
table(x)[table(x) == max(table(x))] # Subsetting!
## 11 ## 2
"Subset the table to when the count is equal to the highest value."
The mean is sensitive to extreme values:
z <- c(2, 5, 3, 5, 95)mean(z)
## [1] 22
The mean is sensitive to extreme values:
z <- c(2, 5, 3, 5, 95)mean(z)
## [1] 22
The median is not:
median(z)
## [1] 5
The mean is sensitive to extreme values:
z <- c(2, 5, 3, 5, 95)mean(z)
## [1] 22
The median is not:
median(z)
## [1] 5
This means the median may be a more useful "average" when your data have extreme values.
This is common with things like income or self-reported number of crimes committed—these always have clumping that makes the mode misleading (e.g., many zeroes)!
High Dispersion
Low Dispersion
The variance ( s2 ) measures how dispersed data are around the mean. Typically we use the sample variance:
s2=∑(xi−ˉx)2n−1
We'll see this in action next week when we look at distributions.
The variance ( s2 ) measures how dispersed data are around the mean. Typically we use the sample variance:
s2=∑(xi−ˉx)2n−1
We'll see this in action next week when we look at distributions.
(s2 <- sum((x - mean(x))^2) / (length(x) -1))
## [1] 52.8
var(x)
## [1] 52.8
The variance ( s2 ) measures how dispersed data are around the mean. Typically we use the sample variance:
s2=∑(xi−ˉx)2n−1
We'll see this in action next week when we look at distributions.
(s2 <- sum((x - mean(x))^2) / (length(x) -1))
## [1] 52.8
var(x)
## [1] 52.8
If every value is the same, the variance is zero—the data would be invariant.
The standard deviation ( s or sd ) is just the square root of the variance:
s=sd=√s2
The standard deviation ( s or sd ) is just the square root of the variance:
s=sd=√s2
You can interpret it as the "typical" distance of values in the data from the mean.
The standard deviation ( s or sd ) is just the square root of the variance:
s=sd=√s2
You can interpret it as the "typical" distance of values in the data from the mean.
sqrt(var(x))
## [1] 7.266361
sd(x)
## [1] 7.266361
Values are about 7.3 away from the mean on average
Mathematically, lines can be defined by a slope and an intercept.
Mathematically, lines can be defined by a slope and an intercept.
You've seen this before, perhaps many moons ago:
y=mx+b
Mathematically, lines can be defined by a slope and an intercept.
You've seen this before, perhaps many moons ago:
y=mx+b
We'll restate it this way:
y=a+bx
Mathematically, lines can be defined by a slope and an intercept.
You've seen this before, perhaps many moons ago:
y=mx+b
We'll restate it this way:
y=a+bx
a is the intercept
Mathematically, lines can be defined by a slope and an intercept.
You've seen this before, perhaps many moons ago:
y=mx+b
We'll restate it this way:
y=a+bx
a is the intercept
b is the slope
y=1+0.5x
plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")abline(a = 1, b = 0.5)
The line intercepts the y-axis at 1.
y=3+0.5x
plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")abline(a = 3, b = 0.5)
The line intercepts the y-axis at 3.
y=2+0.5x
plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")abline(a = 2, b = 0.5)
The line intercepts the y-axis at 2.
y=2+0.5x
plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")abline(a = 2, b = 0.5)
From 2, the line increases by 0.5 for every x.
y=2+0x
plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")abline(a = 2, b = 0)
From 2, the line increases by 0 for every x.
y=2+2x
plot(c(0,5), c(0,5), type = "n", xlab = "x", ylab = "y")abline(a = 2, b = 2)
From 2, the line increases by 2 for every x.
The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)
Interpret d as "a little bit of"
The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)
Interpret d as "a little bit of"
The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)
Interpret d as "a little bit of"
The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)
Interpret d as "a little bit of"
For a straight line, this is the same everywhere—it has a constant slope.
For curves, the slope is different depending on where on the curve you're looking.
The derivative (e.g., dydx ) is a function giving the rate of change (the slope) at a given point of another function (like a line or curve)
Interpret d as "a little bit of"
For a straight line, this is the same everywhere—it has a constant slope.
For curves, the slope is different depending on where on the curve you're looking.
A derivative lets us find exactly what that slope is wherever we want to look
You can define a curve in the same line formula:
y=2+0.5x+0.25x2
You can define a curve in the same line formula:
y=2+0.5x+0.25x2
A squared or quadratic term (e.g. x2) creates a parabola.
curve(2 + 0.5*x + 0.25*x^2, from = -2, to = 2, ylab = "y")
You can define a curve in the same line formula:
y=2+0.5x+0.25x2
A squared or quadratic term (e.g. x2) creates a parabola.
curve(2 + 0.5*x + 0.25*x^2, from = -2, to = 2, ylab = "y")
What is the slope of this curve?
While a curve has many different slopes, all those slopes can be defined by a single derivative1
[1] At least for any curves we're going to talk about!
While a curve has many different slopes, all those slopes can be defined by a single derivative1
[1] At least for any curves we're going to talk about!
Given y=a+xn, dydx=nxn−1
Basic rules:
While a curve has many different slopes, all those slopes can be defined by a single derivative1
[1] At least for any curves we're going to talk about!
Given y=a+xn, dydx=nxn−1
Basic rules:
So...
These rules work for polynomials with more terms.
These rules work for polynomials with more terms.
y=35+3x+0.5x2+0.25x3
These rules work for polynomials with more terms.
y=35+3x+0.5x2+0.25x3
These rules work for polynomials with more terms.
y=35+3x+0.5x2+0.25x3
dydx=3+x+0.75x2
These rules work for polynomials with more terms.
y=35+3x+0.5x2+0.25x3
dydx=3+x+0.75x2
When x=2 the slope is...
x <- 23 + x + 0.75*x^2
## [1] 8
It will be clear soon, but for now:
It will be clear soon, but for now:
Most statistical models are estimators of conditional means, medians, or modes
It will be clear soon, but for now:
Most statistical models are estimators of conditional means, medians, or modes
Model uncertainty is estimated using variances
It will be clear soon, but for now:
Most statistical models are estimators of conditional means, medians, or modes
Model uncertainty is estimated using variances
Models are estimated using matrices and calculus
It will be clear soon, but for now:
Most statistical models are estimators of conditional means, medians, or modes
Model uncertainty is estimated using variances
Models are estimated using matrices and calculus
The most used model parameters tell us how y changes when x changes
It will be clear soon, but for now:
Most statistical models are estimators of conditional means, medians, or modes
Model uncertainty is estimated using variances
Models are estimated using matrices and calculus
The most used model parameters tell us how y changes when x changes
Those are derivatives
BOO
There are two main ways index objects: square brackets ([]
or [[]]
) and $
. How you access an object depends on its dimensions.
There are two main ways index objects: square brackets ([]
or [[]]
) and $
. How you access an object depends on its dimensions.
Dataframes have 2 dimensions: rows and columns. Brackets subset using object[row, column]
. Leaving the row or column place empty selects all elements of that dimension.
USArrests[1,] # First row
## Murder Assault UrbanPop Rape## Alabama 13.2 236 58 21.2
There are two main ways index objects: square brackets ([]
or [[]]
) and $
. How you access an object depends on its dimensions.
Dataframes have 2 dimensions: rows and columns. Brackets subset using object[row, column]
. Leaving the row or column place empty selects all elements of that dimension.
USArrests[1,] # First row
## Murder Assault UrbanPop Rape## Alabama 13.2 236 58 21.2
USArrests[1:3, 3:4] # First three rows, third and fourth column
## UrbanPop Rape## Alabama 58 21.2## Alaska 48 44.5## Arizona 80 31.0
The colon operator (:
) generates a vector using the sequence of integers from its first argument to its second. 1:3
is equivalent to c(1,2,3)
.
We can also subset using the names of rows or columns:
USArrests["California",]
## Murder Assault UrbanPop Rape## California 9 276 91 40.6
We can also subset using the names of rows or columns:
USArrests["California",]
## Murder Assault UrbanPop Rape## California 9 276 91 40.6
head(USArrests[, c("Murder", "UrbanPop")])
## Murder UrbanPop## Alabama 13.2 58## Alaska 10.0 48## Arizona 8.1 80## Arkansas 8.8 50## California 9.0 91## Colorado 7.9 78
If you subset to a single column, it returns it as a vector instead of a dataframe:
USArrests[, "Murder"]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6## [13] 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1## [25] 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0 0.8 7.3 6.6## [37] 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 8.5 4.0 5.7## [49] 2.6 6.8
If you subset to a single column, it returns it as a vector instead of a dataframe:
USArrests[, "Murder"]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6## [13] 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1## [25] 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0 0.8 7.3 6.6## [37] 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 8.5 4.0 5.7## [49] 2.6 6.8
Columns in dataframes can also be accessed using names with the $
extract operator:
USArrests$Murder
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4 5.3 2.6## [13] 10.4 7.2 2.2 6.0 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1## [25] 9.0 6.0 4.3 12.2 2.1 7.4 11.4 11.1 13.0 0.8 7.3 6.6## [37] 4.9 6.3 3.4 14.4 3.8 13.2 12.7 3.2 2.2 8.5 4.0 5.7## [49] 2.6 6.8
$
You may have noticed $
before when we used str()
:
str(USArrests)
## 'data.frame': 50 obs. of 4 variables:## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
Like the matrix subsetting suggestions, it is a hint you can select columns that way.
USArrests$Murder[1:10]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4
Note here I also used brackets to select just the first 10 elements of that column.
USArrests$Murder[1:10]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4
Note here I also used brackets to select just the first 10 elements of that column.
You can mix subsetting formats! In this case I provided only a single value (no column index) because vectors have only one dimension (length).
$Murder
[1:10]
USArrests$Murder[1:10]
## [1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4
Note here I also used brackets to select just the first 10 elements of that column.
You can mix subsetting formats! In this case I provided only a single value (no column index) because vectors have only one dimension (length).
$Murder
[1:10]
If you try to subset something and get a warning about "incorrect number of dimensions", check your subsetting!
We can also index using expressions—logical tests.
USArrests[USArrests$Murder > 15, ]
## Murder Assault UrbanPop Rape## Florida 15.4 335 80 31.9## Georgia 17.4 211 60 25.8## Louisiana 15.4 249 66 22.2## Mississippi 16.1 259 44 17.1
We can also index using expressions—logical tests.
USArrests[USArrests$Murder > 15, ]
## Murder Assault UrbanPop Rape## Florida 15.4 335 80 31.9## Georgia 17.4 211 60 25.8## Louisiana 15.4 249 66 22.2## Mississippi 16.1 259 44 17.1
What does this give us?
What does USArrests$Murder > 15
actually do?
What does USArrests$Murder > 15
actually do?
USArrests$Murder > 15
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE## [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE## [21] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE## [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE## [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
What does USArrests$Murder > 15
actually do?
USArrests$Murder > 15
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE## [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE## [21] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE## [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE## [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
It returns a vector of TRUE
or FALSE
values.
When used with the subset operator ([]
), elements for which a TRUE
is given are returned while those corresponding to FALSE
are dropped.
What does USArrests$Murder > 15
actually do?
USArrests$Murder > 15
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE## [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE## [21] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE## [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE## [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
It returns a vector of TRUE
or FALSE
values.
When used with the subset operator ([]
), elements for which a TRUE
is given are returned while those corresponding to FALSE
are dropped.
c(1,2,3,4)[c(TRUE, FALSE, TRUE, FALSE)]
## [1] 1 3
We used >
for testing "greater than": USArrests$Murder > 15
.
We used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
We used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal toWe used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal toWe used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal to>
, >=
, <
, <=
: less than, less than or equal to, etc.We used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal to>
, >=
, <
, <=
: less than, less than or equal to, etc.%in%
: used with checking equal to one of several valuesWe used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal to>
, >=
, <
, <=
: less than, less than or equal to, etc.%in%
: used with checking equal to one of several valuesOr we can combine multiple logical conditions:
&
: both conditions need to hold (AND)We used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal to>
, >=
, <
, <=
: less than, less than or equal to, etc.%in%
: used with checking equal to one of several valuesOr we can combine multiple logical conditions:
&
: both conditions need to hold (AND)|
: at least one condition needs to hold (OR)We used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal to>
, >=
, <
, <=
: less than, less than or equal to, etc.%in%
: used with checking equal to one of several valuesOr we can combine multiple logical conditions:
&
: both conditions need to hold (AND)|
: at least one condition needs to hold (OR)!
: inverts a logical condition (TRUE
becomes FALSE
, FALSE
becomes TRUE
)We used >
for testing "greater than": USArrests$Murder > 15
.
There are many other logical operators:
==
: equal to!=
: not equal to>
, >=
, <
, <=
: less than, less than or equal to, etc.%in%
: used with checking equal to one of several valuesOr we can combine multiple logical conditions:
&
: both conditions need to hold (AND)|
: at least one condition needs to hold (OR)!
: inverts a logical condition (TRUE
becomes FALSE
, FALSE
becomes TRUE
)Logical operators are one of the foundations of programming. You should experiment with these to become familiar with how they work!
&
USArrests[USArrests$Murder > 15 & USArrests$Assault > 300, ]
## Murder Assault UrbanPop Rape## Florida 15.4 335 80 31.9
|
USArrests[USArrests$Murder > 15 | USArrests$Assault > 300, ]
## Murder Assault UrbanPop Rape## Florida 15.4 335 80 31.9## Georgia 17.4 211 60 25.8## Louisiana 15.4 249 66 22.2## Mississippi 16.1 259 44 17.1## North Carolina 13.0 337 45 16.1
Missing values are coded as NA
entries without quotes:
vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
Missing values are coded as NA
entries without quotes:
vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
Even one NA
"poisons the well": You'll get NA
out of your calculations unless you remove them manually or use the extra argument na.rm = TRUE
in some functions:
mean(vector_w_missing)
## [1] NA
Missing values are coded as NA
entries without quotes:
vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
Even one NA
"poisons the well": You'll get NA
out of your calculations unless you remove them manually or use the extra argument na.rm = TRUE
in some functions:
mean(vector_w_missing)
## [1] NA
We can take missings (NA
) and remove (rm
) them:
mean(vector_w_missing, na.rm=TRUE)
## [1] 3.6
WARNING: You can't test for missing values by seeing if they "equal" (==
) NA
:
vector_w_missing == NA
## [1] NA NA NA NA NA NA NA
WARNING: You can't test for missing values by seeing if they "equal" (==
) NA
:
vector_w_missing == NA
## [1] NA NA NA NA NA NA NA
But you can use the is.na()
function:
is.na(vector_w_missing)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
WARNING: You can't test for missing values by seeing if they "equal" (==
) NA
:
vector_w_missing == NA
## [1] NA NA NA NA NA NA NA
But you can use the is.na()
function:
is.na(vector_w_missing)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
We can use subsetting to get the equivalent of na.rm=TRUE
:
mean(vector_w_missing[!is.na(vector_w_missing)])
## [1] 3.6
!
reverses a logical condition. Read the above as "subset to not NA
"
Read Kaplan chapters 3 and 4
Try a bit more swirl
Math Review
Variables, vectors, and matrices
Lines and curves
Derivatives
Programming
Indexing and Subsetting
Logical Expressions
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |