These documents depict both Stata and R approaches to data management, modeling, and plotting seen in SOC505. They are presented side-by-side to make it easier to move between platforms. It also demonstrates that Stata and R are quite similar in syntax in many cases. R often takes more code to do common tasks than Stata, but can do many things that are not possible in Stata. Both are excellent platforms for quantitative research, though Stata is more common in economics and R is more common in statistics and data science. Note these pages are works-in-progress: If you see mistakes or major omissions, make a comment or pull request on the GitHub repository.

- Continuous Models (Linear Model)
- Binary Models (Logit)
- Multinomial Models (Multinomial Logit)
- Ordinal Models (Ordinal Logit)
- Count Models (Poisson, Negative Binomial)
- Model Tests and Fit
- Plotting

Provided a Stata data set (`.dta`

file), we can read it into either Stata or R.

Stata can only load one data set at a time, so it does not get assigned to an object.

`use example_data.dta`

This will load from Stata’s current working directory. You can change the working directory using `cd "path"`

(e.g. `cd "C:\Users\me\documents"`

).

R can load an arbitrary number of data sets at once, so they must each be assigned a name. I recommend `read_dta()`

in the `haven`

package–part of the `tidyverse`

–for loading Stata files because it preserves Stata labels.

`example_data <- haven::read_dta("example_data.dta")`

R will load this file from your current working directory. If using a .Rmd file, this will default to the directory the .Rmd file is in. You can change the working directory with `setwd("path")`

(e.g. `setwd("C:\Users\me\documents")`

).

For the first homework, we created a sample of simulated data with a specified covariance matrix and means.

```
matrix mean_vec=(1.0, 2.0, 3.0)
matrix cov_mat=(1.0, .75, 1.0 \ ///
.75, 1.5, 0.0 \ ///
1.0, 0.0, 2.0)
corr2data x y z, n(300) mean(mean_vec) ///
cov(cov_mat) cstorage(full) seed(341305)
```

In Stata we create a vector of means (named `mean_vec`

here) and a matrix of covariances (named `cov_mat`

). `corr2data`

then generates data with variables named `x`

, `y`

, and `z`

whose means (`mean(mean_vec)`

) and covariances (`cov(cov_mat)`

) match what we provided. We specify the sample size with `n()`

.

```
mean_vec <- c("x" = 1.0, "y" = 2.0, "z" = 3.0)
cov_mat <- rbind(c(1.0, .75, 1.0),
c(.75, 1.5, 0.0),
c(1.0, 0.0, 2.0))
example_data <- data.frame(MASS::mvrnorm(300,
mu = mean_vec,
Sigma = cov_mat,
empirical = TRUE))
```

In R, the operation is similar to Stata. We provide a vector of means (`mean_vec`

) and a covariance matrix (`cov_mat`

). Note we create vectors with `c()`

and combined them into a matrix by rows with `rbind()`

(row bind). `mvrnorm()`

in the `MASS`

has number of observations (`300`

) as its first argument. `::`

allows us to use a function from a package without loading the whole package.

R and Stata are very different with regard to storing, accessing, and modifying data and variables. Stata only loads one data set at a time. R stores all object including data as objects and can have an arbitrary number loaded at once. Below I create an interaction term between `x`

and `z`

and a discrete version of `x`

.

In Stata, we just use `generate`

and the variable is created in our current data.

```
generate x_z = x * z
generate x_disc = 0
replace x_disc = 1 if x < .5
replace x_disc = 2 if x > .5 & x < 1.5
replace x_disc = 3 if x > 1.5
```

In R, we assign variables inside the data we want to use. We can do this with base R:

```
example_data$x_z <- example_data$x * example_data$z
example_data$x_disc <-
ifelse(example_data$x < .5, 1,
ifelse(example_data$x > .5 & example_data$x < 1.5, 2,
ifelse(example_data$x > 1.5, 3, 0)))
```

Note here I chained three `ifelse()`

statements together in one call. This is complex, but powerful.

Or we can use the `dplyr`

package:

```
example_data <- example_data %>% mutate(x_z = x * z)
example_data <- example_data %>%
mutate(x_disc =
case_when(
x < .5 ~ 1,
x > .5 & x < 1.5 ~ 2,
x > 1.5 ~ 3
))
```

`case_when()`

does seuqential `ifelse()`

statements but is much more readable.

Dummy variables can be prepared for models in two ways: Creating individual dummy variables or specifying them in the model formula. I prefer to store categorical data as factors or strings (character data). This way you don’t confuse them with numeric variables.

We can specify dummies directly in the formula in either Stat or R.

In Stata, we just use append our categorical variable with `i.`

:

`glm y i.x_disc `

In R, variables stored as factors automatically turn into dummies:

`glm(y ~ factor(x_disc), data=example_data)`

Note if you assigned it as a factor when you created or loaded the data, you don’t need to specify it as a factor in the formula.

We can also manually create dummies in the actual data (Jerry’s preference)

We can either generate each dummy individually or use a `tabulate`

shortcut.

```
generate x_d1 = 0
replace x_d1 = 1 if x_disc==1
generate x_d2 = 0
replace x_d2 = 2 if x_disc==2
generate x_d3 = 0
replace x_d3 = 3 if x_disc==3
tabulate x_disc, generate(x_d)
```

Base R:

```
example_data$x_d1 <- ifelse(example_data$x_disc==1, 1, 0)
example_data$x_d2 <- ifelse(example_data$x_disc==2, 1, 0)
example_data$x_d3 <- ifelse(example_data$x_disc==3, 1, 0)
```

`dplyr`

:

```
example_data <- example_data %>%
mutate(x_d1 = ifelse(x_disc==1, 1, 0),
x_d2 = ifelse(x_disc==2, 2, 0),
x_d3 = ifelse(x_disc==3, 3, 0))
```

This will generate a lag of `x`

and also a difference.

```
gen lag_x = x[_n-1]
gen diff_x = x - lag_x
```

Here is code to get a lag and differenced `x`

for *panel data* in Stata. Note you must set a grouping variable and a variable indicating the time of the observation.

```
xtset group_variable time_variable
gen lag_x = L.x
gen diff_x = x - lag_x
```

Creating a lagged variable or a differenced variable is easy. Note if you are working with *panel data* or other repeated observations, you must *group* your data before doing the lag.

One after the other in base R:

```
example_data$lag_x <- lag(x)
example_data$diff_x <-
example_data$x - example_data$lag_x
```

Both at once in `dplyr`

.

```
example_data <- example_data %>%
mutate(lag_x = lag(x),
diff_x = x - lag(x))
```

Grouped (for panel data) in `dplyr`

. This assumes you have data `sort()`

ed by ascending time already.

```
example_data <- example_data %>%
group_by(group_variable) %>%
mutate(lag_x = lag(x),
diff_x = x - lag(x))
```

```
tab x_disc
tab x_d1 x_d2
```

Base R:

```
table(example_data$x_disc)
table(example_data$x_d1, example_data$x_d2)
```

`dplyr`

:

```
example_data %>% count(x_disc)
example_data %>% count(x_d1, x_d2)
```

You can generate predictions as a new variable in your data:

`predict xb pr_vals`

Or get predictions at set values with `prvalue`

. You will need to install `prvalue`

from a package such as `spost9_ado`

using `net install spost9_ado.pkg`

```
prvalue , x(y=1 z=2)
prvalue , x(y=10 z=-1)
```

R uses the `predict()`

function to generate predicted values.

`predict(example_model2)`

If you want to predict an outcome at specific values of a variable, you need to provide new data. The new data must have a value for every variable on the right hand side of the model. It accepts multiple rows to produce multiple predictions as well.

```
predict(example_model2,
newdata=data.frame(y=1, z=2),
type="response")
predict(example_model2,
newdata=data.frame(y=rep(1,10),
z=seq(-1,1, length.out=10)),
type="response")
```

To be added.

There are many ways to remove cases in R. You can remove one or more observations by index:

```
data_subset <- example_data[-10,]
data_subset <- example_data[-c(10,25),]
```

The first removes the 10th case, the second the 10th and 25th cases. You can also remove cases by criteria:

`data_subset <- example_data[x >= 0, ]`

This would drop any cases for which `x`

is less than zero. You can do this in `dplyr`

as well:

```
data_subset <- example_data %>%
filter(x >= 0)
```

You can use any standard logical operators to subset data this way. See these lecture slides for more detail.