Count Models

Poisson Models

Poisson regression is a generalized linear model for count data with an equal mean and variance. Coefficients are reported in log-counts. These examples assume y_count is a count outcome.

Stata

There are two options for fitting a Poisson model in Stata: glm and poisson.

glm y_count x z, family(poisson) link(log)
poisson y_count x z

Results are equivalent but poisson offers more postestimation commands.

R

glm() with family = poisson(link = "log") fits a Poisson regression.

example_pois <- glm(y_count ~ x + z, 
                    family = poisson(link = "log"), 
                    data = example_data)
summary(example_pois)

summary() produces summary output with standard errors.

Negative Binomial Models

Negative binomial regression is a generalized linear model for count data with a variance which is a multiplicative factor of the mean (it is overdispersed). Coefficients are reported in log-counts. These examples assume y_count is a count outcome.

Stata

We fit a negative binomial regression in Stata using nbreg.

nbreg y_count x z

R

The most commonly used negative binomial function in R is in the MASS package. I recommend you do not load the MASS library, but rather just use MASS::glm.nb to call it right out of MASS. This is because MASS has a function called select() in it which will mask over the dplyr one.

example_nb <- MASS::glm.nb(y_count ~ x + z, 
                    data = example_data)
summary(example_nb)

summary() produces summary output with standard errors.

Zero-Inflated Count Models

Zero-Inflated Poisson

Zero-inflated Poisson regression is a generalized linear model for count data with an equal mean and variance but a greater number of zeroes than normal. These zeroes may arise from a different process than the counts: some variables may predict absence of counts while others predict levels if a count is possible. Coefficients are reported in log-counts. These examples assume y_count is a count outcome.

Stata

zip is used to fit a Zero-Inflated Poisson in Stata. The variables in inflate() are used to predict zero counts.

zip y_count x z, inflate(z)

R

The most commonly used Zero-Inflated Poisson model for R is zeroinfl found in the pscl package. Variables after the bar | are used to predict zero counts. Note this command only works if there are some zero counts.

example_zip <- 
  pscl::zeroinfl(y_count ~ x + z | z, 
                 data = example_data)
summary(example_zip)

summary() produces summary output with standard errors.

Zero-Inflated Negative Binomial

Zero-inflated Negative Binomial regression is a generalized linear model for overdispersed count data with a greater number of zeroes than normal. These zeroes may arise from a different process than the counts: some variables may predict absence of counts while others predict levels if a count is possible. Coefficients are reported in log-counts. These examples assume y_count is a count outcome.

Stata

zinb is used to fit a Zero-Inflated Negative Binomial in Stata. The variables in inflate() are used to predict zero counts.

zip y_count x z, inflate(z)

R

You can fit Zero-Inflated Negative Binomial models in R using the same zeroinfl function for ZI Poisson models found in the pscl package. Just add an argument dist = "negbin". Variables after the bar | are used to predict zero counts. Note this command only works if there are some zero counts.

example_zinb <- 
  pscl::zeroinfl(y_count ~ x + z | z, 
                 data = example_data, 
                 dist = "negbin")
summary(example_zinb)

summary() produces summary output with standard errors.

Predicted Values for Poisson

We might want the estimated probability of observing a given count for particular values of covariates. These example use the poisson output but should work equivalently for the negative binomial (at least in R).

Stata

prvalue is a postestimation command, so it must be run immediately after a model. It requires using poisson instead of glm. prvalue will return the predicted rate (mean) count as well as probability of observing particular counts.

poisson y_count x z
prvalue, x(x=2 z=0) maxcnt(8)

x() is used to set values of covariates and maxcnt() sets the maximum count for which a probability is returned.

We could also plot the observed distribution of counts versus the predicted probabilities.

poisson y_count x z
prcounts counts, max(8) plot
graph twoway (connected countobeq countval) || (connected countpreq countval)

This produces a dashed predicted distribution and a solid true distribution from the data.

R

In R, we can do this with predict() and the Poisson probability density function dpois().

pred_pois <- predict(example_pois, type="response",
                    newdata=data.frame(x=2, z=2))
count_range <- 0:10
prob_pois   <- dpois(count_range, pred_pois)
barplot(prob_pois, names = count_range)

If we want to compare the predicted distribution to the actual distribution, we could use histograms.

library(ggplot2)
ggplot() + 
  geom_histogram(
    aes(x=example_data$y_count, stat(density)),
        bins=8, alpha=0.5, fill="blue") +
  geom_histogram(
    aes(x=fitted(example_pois), stat(density)),
        bins=8, alpha=0.5, fill="red")

Or we could use densities, though note that densities around 0 will be lower than the true probability.

plot(density(fitted(example_pois)), 
     lty="dashed", col="red")
lines(density(example_data$y_count), col = "blue" )

Predicted Ranges for Poisson

We might be interested in plotting predicted values across ranges of covariates.

Stata

We can use prgen to get predicted values given some fixed covariates and multiple values of another covariate.

poisson y_count x z
prgen z, x(x=0) from(-2) to (4) generate(zc) n(20)
graph twoway (connected zcp0 zcx) || (connected zcp1 zcx) || (connected zcp2 zcx)

This produces predicted probabilities of observing a given count (zcp0 is count=0, zcp2 is count=2) across values of z.

R

Mimicking Stata’s prgen-style plot in Base R is a bit bulky. See in-line comments. First, I’ve generated example data here to help:

mean_vec <- c("x" = 1.0, "y" = 2.0, "z" = 3.0)
cov_mat <- rbind(c(1.0, .75, 1.0),
                 c(.75, 1.5, 0.0),
                 c(1.0, 0.0, 2.0))
example_data <- data.frame(
                  MASS::mvrnorm(300,
                                mu = mean_vec, 
                                Sigma = cov_mat,
                                empirical = TRUE))
example_data$y_count <- 
            with(example_data, rpois(300, 10+x+z))

Setting the values we want to work with and predicting counts:

n_vals    <- 20  # number of x-values to plot over
pr_counts <- c(9, 12, 15, 18) # counts to get probs for
# z_range sets the range of a covariate to plot over
# In seq() the first value is a minimum, second is max
z_range   <- round(seq(0, 6, length.out=n_vals),2)
example_pois <- glm(y_count ~ x + z, 
                    family = poisson(link = "log"), 
                    data = example_data)

# generate predicted means
lambdas <- 
  predict(example_pois, type="response",
          newdata = data.frame(x = 0, z = z_range))

Then plotting…

First, the ggplot2 and dplyr way to do this:

library(dplyr)
library(ggplot2)
data.frame(Count  = rep(pr_counts, each=n_vals),
           Z      = rep(z_range, length(pr_counts)),
           lambda = rep(lambdas, length(pr_counts))) %>%
  mutate(Probability = dpois(Count, lambda),
         Count       = factor(Count)) %>%
  ggplot(aes(x = Z, y = Probability, color = Count)) +
  geom_line()

Then the hideous base R way to do this:

# generate a matrix of predicted probabilities
# rownames are value of z, colnames # of counts
pr_values <- 
  matrix(dpois(rep(pr_counts, each=n_vals),
               rep(lambdas, length(pr_counts))),
         ncol=length(pr_counts), byrow=FALSE,
         dimnames = list(z_range, pr_counts))
# plot the matrix
# plotted number is the count
# y-axis is probability of that count
# x-axis is value of z
matplot(z_range, pr_values,
        pch = pr_counts)
legend(x = min(z_range), y = max(pr_values),
       legend = pr_counts, pch = pr_counts)

Another potentially useful plot is just the predicted counts across levels of covariates:

library(ggeffects)
plot(ggpredict(example_pois, terms="x"))