Mixing Poisson models

Poisson regressions are often very good models of social processes, but they suffer from the restrictions imposed by the Poisson distribution. These notes look at two common methods for broadening the applicability of a Poisson model: over-dispersion and zero-inflation.

But we'll start with a regular Poisson regression as a baseline for comparison.

We can immidiately tell from these results that boys play more games than girls, white students play fewer games than non-white students, and older students play fewer games than younger students. But to get an idea of the magnitude we need to exponentiate the estimates.

These results predict that white, grade-10 girls will play about 1.6 hours of games per week on average, and that boys will play about 3.06 times as much as girls regardless of age and race. White students are predicted to play about 73.3% as much as nonwhite students. Finally, for each year increase in students' grade, they are predicted to play games about 13.1% less long.

To get an idea of the adequacy of the model, let's compare a posterior prediction of the outcome variable to the actual data.

Our model is not doing a great job. It is drastically under-predicting the number of people who play zero hours and more than 15 hours per week, and over-predicting the number of people who play 3-10 hours per week.

Over-dispersed Poisson

Clearly our model is puttinig too much of its posterior prediction close to the center of the distribution. This is very common with a Poisson regression, as the Poisson distibution is fixed at a relatively "narrow" variance. In models of social processes, unobserved processes that influence our outcome will lead to more variation in that outcome. One way to try to address this is to allow more random variation in $\lambda_i$, so that students who look identical on paper can still have distinct rates of game playing.

Comparing the coefficient estimates from this over-dispersed model to the standard Poisson model, it is clear that the original model was over-estimating the magnitude of association for gender and race. Now boys are only predicted to play about 2.34 times as much as girls, and white students are expected to play about 84% as much as nonwhite students.

Moreover, the estimate of theta is 8.19. If theta were zero, that would tell us that there is no difference at all between the over-dispersed and standard Poisson distributions. There are no hard rules about values of theta, but anything over 1 or 2 is good justification for using an over-dispersed model.

More importantly, we can use a posterior predictive plot to see if we are doing a better job fitting our population data.

The over-dispersed model clearly does a much better job fitting our data.

At this point we should be pretty happy with the over-dispersed Poisson model, but for the sake of demonstation, let's try a different modification of the standard Poisson.

Zero-inflated Poisson

The zero-inflated Poisson model postulates a two-stage process for generating data. Very often with count data, there are more zero-counts than a Poisson distribution can account for. In many social processes, certain members of the population have a count of zero simply because they never had an opportunity otherwise. In the case of our current model, we might think that only certain students own a gaming console at home or live close to an arcade (remember, this data is from 1990), and the students who do not will always play zero hours per week. Of course, some of the students who do have a console or live near an arcade will still not play games last week. In a zero-inflated Poisson regression, the zero outcomes come from two sources: lack of opportunity and the zeros included in a Poisson distribution.

The key point here is that we don't need to observe the opportunity to play games (if we could observe it, we would simply only model students who did have such an opportunity). Instead, we just model the probability of opportunity as an unobserved characteristic of each student.

This means that there are two models embedded in one: a binomial model determining opportunity to play games, and a Poisson model determining hours played for those with the opportunity to play at all.

The first four coefficients estimate the probability that student will not have an opportunity to play games (will have a count of zero deterministically). We see that boys, nonwhite students, and students from wealthier families are less likely to fall into this category (e.g. are more likely to have access to video games).

The last four coefficients estimate the rate of game playing for students who do have an opportunity to play games (remember, they may still play zero hours). Accounting for zero-inflation greatly increased the average baseline number of hours predicted to about exp(1.4)=4.1 hours per week. At the same time, it decreased the magnitude of the other cofficients. Boys are only expected to play about 1.72 times as much as girls in this model. This is because much of the gendered effect in time spent paying has been absorbed into the binomial portion of the model: girls are less likely to have an opportunity to play games than boys.

There is a lot more to unpack in those coefficient estimates, but for now let's just look at the posterior predictive plot to see how our model stacks up.