Poisson regression is a generalized linear model for count data with an equal mean and variance. Coefficients are reported in log-counts. These examples assume y_count
is a count outcome.
There are two options for fitting a Poisson model in Stata: glm
and poisson
.
glm y_count x z, family(poisson) link(log)
poisson y_count x z
Results are equivalent but poisson
offers more postestimation commands.
glm()
with family = poisson(link = "log")
fits a Poisson regression.
example_pois <- glm(y_count ~ x + z,
family = poisson(link = "log"),
data = example_data)
summary(example_pois)
summary()
produces summary output with standard errors.
Negative binomial regression is a generalized linear model for count data with a variance which is a multiplicative factor of the mean (it is overdispersed). Coefficients are reported in log-counts. These examples assume y_count
is a count outcome.
We fit a negative binomial regression in Stata using nbreg
.
nbreg y_count x z
The most commonly used negative binomial function in R is in the MASS
package. I recommend you do not load the MASS library, but rather just use MASS::glm.nb
to call it right out of MASS
. This is because MASS
has a function called select()
in it which will mask over the dplyr
one.
example_nb <- MASS::glm.nb(y_count ~ x + z,
data = example_data)
summary(example_nb)
summary()
produces summary output with standard errors.
Zero-inflated Poisson regression is a generalized linear model for count data with an equal mean and variance but a greater number of zeroes than normal. These zeroes may arise from a different process than the counts: some variables may predict absence of counts while others predict levels if a count is possible. Coefficients are reported in log-counts. These examples assume y_count
is a count outcome.
zip
is used to fit a Zero-Inflated Poisson in Stata. The variables in inflate()
are used to predict zero counts.
zip y_count x z, inflate(z)
The most commonly used Zero-Inflated Poisson model for R is zeroinfl
found in the pscl
package. Variables after the bar |
are used to predict zero counts. Note this command only works if there are some zero counts.
example_zip <-
pscl::zeroinfl(y_count ~ x + z | z,
data = example_data)
summary(example_zip)
summary()
produces summary output with standard errors.
Zero-inflated Negative Binomial regression is a generalized linear model for overdispersed count data with a greater number of zeroes than normal. These zeroes may arise from a different process than the counts: some variables may predict absence of counts while others predict levels if a count is possible. Coefficients are reported in log-counts. These examples assume y_count
is a count outcome.
zinb
is used to fit a Zero-Inflated Negative Binomial in Stata. The variables in inflate()
are used to predict zero counts.
zip y_count x z, inflate(z)
You can fit Zero-Inflated Negative Binomial models in R using the same zeroinfl
function for ZI Poisson models found in the pscl
package. Just add an argument dist = "negbin"
. Variables after the bar |
are used to predict zero counts. Note this command only works if there are some zero counts.
example_zinb <-
pscl::zeroinfl(y_count ~ x + z | z,
data = example_data,
dist = "negbin")
summary(example_zinb)
summary()
produces summary output with standard errors.
We might want the estimated probability of observing a given count for particular values of covariates. These example use the poisson output but should work equivalently for the negative binomial (at least in R).
prvalue
is a postestimation command, so it must be run immediately after a model. It requires using poisson
instead of glm
. prvalue
will return the predicted rate (mean) count as well as probability of observing particular counts.
poisson y_count x z
prvalue, x(x=2 z=0) maxcnt(8)
x()
is used to set values of covariates and maxcnt()
sets the maximum count for which a probability is returned.
We could also plot the observed distribution of counts versus the predicted probabilities.
poisson y_count x z
prcounts counts, max(8) plot
graph twoway (connected countobeq countval) || (connected countpreq countval)
This produces a dashed predicted distribution and a solid true distribution from the data.
In R, we can do this with predict()
and the Poisson probability density function dpois()
.
pred_pois <- predict(example_pois, type="response",
newdata=data.frame(x=2, z=2))
count_range <- 0:10
prob_pois <- dpois(count_range, pred_pois)
barplot(prob_pois, names = count_range)
If we want to compare the predicted distribution to the actual distribution, we could use histograms.
library(ggplot2)
ggplot() +
geom_histogram(
aes(x=example_data$y_count, stat(density)),
bins=8, alpha=0.5, fill="blue") +
geom_histogram(
aes(x=fitted(example_pois), stat(density)),
bins=8, alpha=0.5, fill="red")
Or we could use densities, though note that densities around 0 will be lower than the true probability.
plot(density(fitted(example_pois)),
lty="dashed", col="red")
lines(density(example_data$y_count), col = "blue" )
We might be interested in plotting predicted values across ranges of covariates.
We can use prgen
to get predicted values given some fixed covariates and multiple values of another covariate.
poisson y_count x z
prgen z, x(x=0) from(-2) to (4) generate(zc) n(20)
graph twoway (connected zcp0 zcx) || (connected zcp1 zcx) || (connected zcp2 zcx)
This produces predicted probabilities of observing a given count (zcp0
is count=0, zcp2
is count=2) across values of z
.
Mimicking Stata’s prgen
-style plot in Base R is a bit bulky. See in-line comments. First, I’ve generated example data here to help:
mean_vec <- c("x" = 1.0, "y" = 2.0, "z" = 3.0)
cov_mat <- rbind(c(1.0, .75, 1.0),
c(.75, 1.5, 0.0),
c(1.0, 0.0, 2.0))
example_data <- data.frame(
MASS::mvrnorm(300,
mu = mean_vec,
Sigma = cov_mat,
empirical = TRUE))
example_data$y_count <-
with(example_data, rpois(300, 10+x+z))
Setting the values we want to work with and predicting counts:
n_vals <- 20 # number of x-values to plot over
pr_counts <- c(9, 12, 15, 18) # counts to get probs for
# z_range sets the range of a covariate to plot over
# In seq() the first value is a minimum, second is max
z_range <- round(seq(0, 6, length.out=n_vals),2)
example_pois <- glm(y_count ~ x + z,
family = poisson(link = "log"),
data = example_data)
# generate predicted means
lambdas <-
predict(example_pois, type="response",
newdata = data.frame(x = 0, z = z_range))
Then plotting…
First, the ggplot2
and dplyr
way to do this:
library(dplyr)
library(ggplot2)
data.frame(Count = rep(pr_counts, each=n_vals),
Z = rep(z_range, length(pr_counts)),
lambda = rep(lambdas, length(pr_counts))) %>%
mutate(Probability = dpois(Count, lambda),
Count = factor(Count)) %>%
ggplot(aes(x = Z, y = Probability, color = Count)) +
geom_line()
Then the hideous base R way to do this:
# generate a matrix of predicted probabilities
# rownames are value of z, colnames # of counts
pr_values <-
matrix(dpois(rep(pr_counts, each=n_vals),
rep(lambdas, length(pr_counts))),
ncol=length(pr_counts), byrow=FALSE,
dimnames = list(z_range, pr_counts))
# plot the matrix
# plotted number is the count
# y-axis is probability of that count
# x-axis is value of z
matplot(z_range, pr_values,
pch = pr_counts)
legend(x = min(z_range), y = max(pr_values),
legend = pr_counts, pch = pr_counts)
Another potentially useful plot is just the predicted counts across levels of covariates:
library(ggeffects)
plot(ggpredict(example_pois, terms="x"))