Poisson regression is a generalized linear model for count data with an equal mean and variance. Coefficients are reported in log-counts. These examples assume y_count is a count outcome.
There are two options for fitting a Poisson model in Stata: glm and poisson.
glm y_count x z, family(poisson) link(log)
poisson y_count x z
Results are equivalent but poisson offers more postestimation commands.
glm() with family = poisson(link = "log") fits a Poisson regression.
example_pois <- glm(y_count ~ x + z,
family = poisson(link = "log"),
data = example_data)
summary(example_pois)
summary() produces summary output with standard errors.
Negative binomial regression is a generalized linear model for count data with a variance which is a multiplicative factor of the mean (it is overdispersed). Coefficients are reported in log-counts. These examples assume y_count is a count outcome.
We fit a negative binomial regression in Stata using nbreg.
nbreg y_count x z
The most commonly used negative binomial function in R is in the MASS package. I recommend you do not load the MASS library, but rather just use MASS::glm.nb to call it right out of MASS. This is because MASS has a function called select() in it which will mask over the dplyr one.
example_nb <- MASS::glm.nb(y_count ~ x + z,
data = example_data)
summary(example_nb)
summary() produces summary output with standard errors.
Zero-inflated Poisson regression is a generalized linear model for count data with an equal mean and variance but a greater number of zeroes than normal. These zeroes may arise from a different process than the counts: some variables may predict absence of counts while others predict levels if a count is possible. Coefficients are reported in log-counts. These examples assume y_count is a count outcome.
zip is used to fit a Zero-Inflated Poisson in Stata. The variables in inflate() are used to predict zero counts.
zip y_count x z, inflate(z)
The most commonly used Zero-Inflated Poisson model for R is zeroinfl found in the pscl package. Variables after the bar | are used to predict zero counts. Note this command only works if there are some zero counts.
example_zip <-
pscl::zeroinfl(y_count ~ x + z | z,
data = example_data)
summary(example_zip)
summary() produces summary output with standard errors.
Zero-inflated Negative Binomial regression is a generalized linear model for overdispersed count data with a greater number of zeroes than normal. These zeroes may arise from a different process than the counts: some variables may predict absence of counts while others predict levels if a count is possible. Coefficients are reported in log-counts. These examples assume y_count is a count outcome.
zinb is used to fit a Zero-Inflated Negative Binomial in Stata. The variables in inflate() are used to predict zero counts.
zip y_count x z, inflate(z)
You can fit Zero-Inflated Negative Binomial models in R using the same zeroinfl function for ZI Poisson models found in the pscl package. Just add an argument dist = "negbin". Variables after the bar | are used to predict zero counts. Note this command only works if there are some zero counts.
example_zinb <-
pscl::zeroinfl(y_count ~ x + z | z,
data = example_data,
dist = "negbin")
summary(example_zinb)
summary() produces summary output with standard errors.
We might want the estimated probability of observing a given count for particular values of covariates. These example use the poisson output but should work equivalently for the negative binomial (at least in R).
prvalue is a postestimation command, so it must be run immediately after a model. It requires using poisson instead of glm. prvalue will return the predicted rate (mean) count as well as probability of observing particular counts.
poisson y_count x z
prvalue, x(x=2 z=0) maxcnt(8)
x() is used to set values of covariates and maxcnt() sets the maximum count for which a probability is returned.
We could also plot the observed distribution of counts versus the predicted probabilities.
poisson y_count x z
prcounts counts, max(8) plot
graph twoway (connected countobeq countval) || (connected countpreq countval)
This produces a dashed predicted distribution and a solid true distribution from the data.
In R, we can do this with predict() and the Poisson probability density function dpois().
pred_pois <- predict(example_pois, type="response",
newdata=data.frame(x=2, z=2))
count_range <- 0:10
prob_pois <- dpois(count_range, pred_pois)
barplot(prob_pois, names = count_range)
If we want to compare the predicted distribution to the actual distribution, we could use histograms.
library(ggplot2)
ggplot() +
geom_histogram(
aes(x=example_data$y_count, stat(density)),
bins=8, alpha=0.5, fill="blue") +
geom_histogram(
aes(x=fitted(example_pois), stat(density)),
bins=8, alpha=0.5, fill="red")
Or we could use densities, though note that densities around 0 will be lower than the true probability.
plot(density(fitted(example_pois)),
lty="dashed", col="red")
lines(density(example_data$y_count), col = "blue" )
We might be interested in plotting predicted values across ranges of covariates.
We can use prgen to get predicted values given some fixed covariates and multiple values of another covariate.
poisson y_count x z
prgen z, x(x=0) from(-2) to (4) generate(zc) n(20)
graph twoway (connected zcp0 zcx) || (connected zcp1 zcx) || (connected zcp2 zcx)
This produces predicted probabilities of observing a given count (zcp0 is count=0, zcp2 is count=2) across values of z.
Mimicking Stata’s prgen-style plot in Base R is a bit bulky. See in-line comments. First, I’ve generated example data here to help:
mean_vec <- c("x" = 1.0, "y" = 2.0, "z" = 3.0)
cov_mat <- rbind(c(1.0, .75, 1.0),
c(.75, 1.5, 0.0),
c(1.0, 0.0, 2.0))
example_data <- data.frame(
MASS::mvrnorm(300,
mu = mean_vec,
Sigma = cov_mat,
empirical = TRUE))
example_data$y_count <-
with(example_data, rpois(300, 10+x+z))
Setting the values we want to work with and predicting counts:
n_vals <- 20 # number of x-values to plot over
pr_counts <- c(9, 12, 15, 18) # counts to get probs for
# z_range sets the range of a covariate to plot over
# In seq() the first value is a minimum, second is max
z_range <- round(seq(0, 6, length.out=n_vals),2)
example_pois <- glm(y_count ~ x + z,
family = poisson(link = "log"),
data = example_data)
# generate predicted means
lambdas <-
predict(example_pois, type="response",
newdata = data.frame(x = 0, z = z_range))
Then plotting…
First, the ggplot2 and dplyr way to do this:
library(dplyr)
library(ggplot2)
data.frame(Count = rep(pr_counts, each=n_vals),
Z = rep(z_range, length(pr_counts)),
lambda = rep(lambdas, length(pr_counts))) %>%
mutate(Probability = dpois(Count, lambda),
Count = factor(Count)) %>%
ggplot(aes(x = Z, y = Probability, color = Count)) +
geom_line()
Then the hideous base R way to do this:
# generate a matrix of predicted probabilities
# rownames are value of z, colnames # of counts
pr_values <-
matrix(dpois(rep(pr_counts, each=n_vals),
rep(lambdas, length(pr_counts))),
ncol=length(pr_counts), byrow=FALSE,
dimnames = list(z_range, pr_counts))
# plot the matrix
# plotted number is the count
# y-axis is probability of that count
# x-axis is value of z
matplot(z_range, pr_values,
pch = pr_counts)
legend(x = min(z_range), y = max(pr_values),
legend = pr_counts, pch = pr_counts)
Another potentially useful plot is just the predicted counts across levels of covariates:
library(ggeffects)
plot(ggpredict(example_pois, terms="x"))