These documents depict both Stata and R approaches to data management, modeling, and plotting seen in SOC505. They are presented side-by-side to make it easier to move between platforms. It also demonstrates that Stata and R are quite similar in syntax in many cases. R often takes more code to do common tasks than Stata, but can do many things that are not possible in Stata. Both are excellent platforms for quantitative research, though Stata is more common in economics and R is more common in statistics and data science. Note these pages are works-in-progress: If you see mistakes or major omissions, make a comment or pull request on the GitHub repository.
Provided a Stata data set (.dta
file), we can read it into either Stata or R.
Stata can only load one data set at a time, so it does not get assigned to an object.
use example_data.dta
This will load from Stata’s current working directory. You can change the working directory using cd "path"
(e.g. cd "C:\Users\me\documents"
).
R can load an arbitrary number of data sets at once, so they must each be assigned a name. I recommend read_dta()
in the haven
package–part of the tidyverse
–for loading Stata files because it preserves Stata labels.
example_data <- haven::read_dta("example_data.dta")
R will load this file from your current working directory. If using a .Rmd file, this will default to the directory the .Rmd file is in. You can change the working directory with setwd("path")
(e.g. setwd("C:\Users\me\documents")
).
For the first homework, we created a sample of simulated data with a specified covariance matrix and means.
matrix mean_vec=(1.0, 2.0, 3.0)
matrix cov_mat=(1.0, .75, 1.0 \ ///
.75, 1.5, 0.0 \ ///
1.0, 0.0, 2.0)
corr2data x y z, n(300) mean(mean_vec) ///
cov(cov_mat) cstorage(full) seed(341305)
In Stata we create a vector of means (named mean_vec
here) and a matrix of covariances (named cov_mat
). corr2data
then generates data with variables named x
, y
, and z
whose means (mean(mean_vec)
) and covariances (cov(cov_mat)
) match what we provided. We specify the sample size with n()
.
mean_vec <- c("x" = 1.0, "y" = 2.0, "z" = 3.0)
cov_mat <- rbind(c(1.0, .75, 1.0),
c(.75, 1.5, 0.0),
c(1.0, 0.0, 2.0))
example_data <- data.frame(MASS::mvrnorm(300,
mu = mean_vec,
Sigma = cov_mat,
empirical = TRUE))
In R, the operation is similar to Stata. We provide a vector of means (mean_vec
) and a covariance matrix (cov_mat
). Note we create vectors with c()
and combined them into a matrix by rows with rbind()
(row bind). mvrnorm()
in the MASS
has number of observations (300
) as its first argument. ::
allows us to use a function from a package without loading the whole package.
R and Stata are very different with regard to storing, accessing, and modifying data and variables. Stata only loads one data set at a time. R stores all object including data as objects and can have an arbitrary number loaded at once. Below I create an interaction term between x
and z
and a discrete version of x
.
In Stata, we just use generate
and the variable is created in our current data.
generate x_z = x * z
generate x_disc = 0
replace x_disc = 1 if x < .5
replace x_disc = 2 if x > .5 & x < 1.5
replace x_disc = 3 if x > 1.5
In R, we assign variables inside the data we want to use. We can do this with base R:
example_data$x_z <- example_data$x * example_data$z
example_data$x_disc <-
ifelse(example_data$x < .5, 1,
ifelse(example_data$x > .5 & example_data$x < 1.5, 2,
ifelse(example_data$x > 1.5, 3, 0)))
Note here I chained three ifelse()
statements together in one call. This is complex, but powerful.
Or we can use the dplyr
package:
example_data <- example_data %>% mutate(x_z = x * z)
example_data <- example_data %>%
mutate(x_disc =
case_when(
x < .5 ~ 1,
x > .5 & x < 1.5 ~ 2,
x > 1.5 ~ 3
))
case_when()
does seuqential ifelse()
statements but is much more readable.
Dummy variables can be prepared for models in two ways: Creating individual dummy variables or specifying them in the model formula. I prefer to store categorical data as factors or strings (character data). This way you don’t confuse them with numeric variables.
We can specify dummies directly in the formula in either Stat or R.
In Stata, we just use append our categorical variable with i.
:
glm y i.x_disc
In R, variables stored as factors automatically turn into dummies:
glm(y ~ factor(x_disc), data=example_data)
Note if you assigned it as a factor when you created or loaded the data, you don’t need to specify it as a factor in the formula.
We can also manually create dummies in the actual data (Jerry’s preference)
We can either generate each dummy individually or use a tabulate
shortcut.
generate x_d1 = 0
replace x_d1 = 1 if x_disc==1
generate x_d2 = 0
replace x_d2 = 2 if x_disc==2
generate x_d3 = 0
replace x_d3 = 3 if x_disc==3
tabulate x_disc, generate(x_d)
Base R:
example_data$x_d1 <- ifelse(example_data$x_disc==1, 1, 0)
example_data$x_d2 <- ifelse(example_data$x_disc==2, 1, 0)
example_data$x_d3 <- ifelse(example_data$x_disc==3, 1, 0)
dplyr
:
example_data <- example_data %>%
mutate(x_d1 = ifelse(x_disc==1, 1, 0),
x_d2 = ifelse(x_disc==2, 2, 0),
x_d3 = ifelse(x_disc==3, 3, 0))
This will generate a lag of x
and also a difference.
gen lag_x = x[_n-1]
gen diff_x = x - lag_x
Here is code to get a lag and differenced x
for panel data in Stata. Note you must set a grouping variable and a variable indicating the time of the observation.
xtset group_variable time_variable
gen lag_x = L.x
gen diff_x = x - lag_x
Creating a lagged variable or a differenced variable is easy. Note if you are working with panel data or other repeated observations, you must group your data before doing the lag.
One after the other in base R:
example_data$lag_x <- lag(x)
example_data$diff_x <-
example_data$x - example_data$lag_x
Both at once in dplyr
.
example_data <- example_data %>%
mutate(lag_x = lag(x),
diff_x = x - lag(x))
Grouped (for panel data) in dplyr
. This assumes you have data sort()
ed by ascending time already.
example_data <- example_data %>%
group_by(group_variable) %>%
mutate(lag_x = lag(x),
diff_x = x - lag(x))
tab x_disc
tab x_d1 x_d2
Base R:
table(example_data$x_disc)
table(example_data$x_d1, example_data$x_d2)
dplyr
:
example_data %>% count(x_disc)
example_data %>% count(x_d1, x_d2)
You can generate predictions as a new variable in your data:
predict xb pr_vals
Or get predictions at set values with prvalue
. You will need to install prvalue
from a package such as spost9_ado
using net install spost9_ado.pkg
prvalue , x(y=1 z=2)
prvalue , x(y=10 z=-1)
R uses the predict()
function to generate predicted values.
predict(example_model2)
If you want to predict an outcome at specific values of a variable, you need to provide new data. The new data must have a value for every variable on the right hand side of the model. It accepts multiple rows to produce multiple predictions as well.
predict(example_model2,
newdata=data.frame(y=1, z=2),
type="response")
predict(example_model2,
newdata=data.frame(y=rep(1,10),
z=seq(-1,1, length.out=10)),
type="response")
To be added.
There are many ways to remove cases in R. You can remove one or more observations by index:
data_subset <- example_data[-10,]
data_subset <- example_data[-c(10,25),]
The first removes the 10th case, the second the 10th and 25th cases. You can also remove cases by criteria:
data_subset <- example_data[x >= 0, ]
This would drop any cases for which x
is less than zero. You can do this in dplyr
as well:
data_subset <- example_data %>%
filter(x >= 0)
You can use any standard logical operators to subset data this way. See these lecture slides for more detail.