---
title: "SOCI 620: Lab 11"
author: "Peter McMahan"
date: "2/17/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
# this will tell rethinking to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
# Mixing Poisson models
Poisson regressions are often very good models of social processes, but they suffer from the restrictions imposed by the Poisson distribution. These notes look at two common methods for broadening the applicability of a Poisson model: over-dispersion and zero-inflation.
But we'll start with a regular Poisson regression as a baseline for comparison.
```{r data}
# load data
d <- read.csv("https://soci620.netlify.com/data/media.csv")
# create a grade-10-centered grade variable
d$grade_c10 <- d$grade - 10
head(d)
```
```{r model_poisson, cache=True}
# build the model
m <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- a + b_m*male + b_w*race_white + b_g*grade_c10,
a ~ dnorm(3,1),
b_m ~ dnorm(0,0.5),
b_w ~ dnorm(0,0.5),
b_g ~ dnorm(0,0.5)
)
# which rows have no NAs for `hours_games`?
d_slim <- d[c('hours_games','male','race_white','grade_c10')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using ulam on only complete cases
fit <- ulam(m, data=d_slim,cores=4,chains=4,refresh=0)
```
```{r model_poisson_precis}
precis(fit,prob=0.9)
```
We can immidiately tell from these results that boys play more games than girls, white students play fewer games than non-white students, and older students play fewer games than younger students. But to get an idea of the magnitude we need to exponentiate the estimates.
```{r model_poisson_exp}
# exponentiate the point estimates
exp(coef(fit))
```
These results predict that white, grade-10 girls will play about 1.6 hours of games per week on average, and that boys will play about 3.06 times as much as girls regardless of age and race. White students are predicted to play about 73.3% as much as nonwhite students. Finally, for each year increase in students' grade, they are predicted to play games about 13.1% fewer hours.
To get an idea of the adequacy of the model, let's compare a posterior prediction of the outcome variable to the actual data.
```{r poisson_posterior_compare}
# actual data:
hour_counts <- table(d$hours_games)
plot(hour_counts,xlim=c(0,100),main="Actual data")
# posterior prediction:
# By default sim() uses the same data we fit the model with.
pp <- sim(fit,n=500) # For each individual, make 500 posterior predictions
dim(pp)
pcounts_poisson <- table(pp)
plot(pcounts_poisson,xlim=c(0,100),main="Standard Poisson posterior pred.")
```
Our model is not doing a *great* job. It is drastically under-predicting the number of people who play zero hours and more than 15 hours per week, and over-predicting the number of people who play 3-10 hours per week.
### Over-dispersed Poisson
Clearly our model is putting too much of its posterior prediction close to the center of the distribution. This is very common with a Poisson regression, as the Poisson distribution is fixed at a relatively "narrow" variance. In models of social processes, unobserved processes that influence the outcome will lead to more variation in that outcome. One way to try to address this is to allow more random variation in $\lambda_i$, so that students who look identical on paper can still have distinct rates of game playing.
```{r overdisperssed_model, cache=TRUE}
# build an over-dispersed model
# the dgampois() function is a convenient wrapper
# around a Poisson-distributed variable with
# gamma-distributed lambda
m_od <- alist(
hours_games ~ dgampois(lambda,theta), # change the outcome distribution
log(lambda) <- a + b_m*male + b_w*race_white + b_g*grade_c10,
a ~ dnorm(3,1),
b_m ~ dnorm(0,0.5),
b_w ~ dnorm(0,0.5),
b_g ~ dnorm(0,0.5),
theta ~ dexp(2) # one more prior for variability of lambda
)
# which rows have no NAs for `hours_games`?
d_slim <- d[c('hours_games','male','race_white','grade_c10')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using ulam on only complete cases
fit_od <- ulam(m_od, data=d_slim, chains=4, cores=4, refresh=0)
```
```{r overdispresed_poisson}
# see estimates
precis(fit_od,prob=0.9)
```
```{r overdispersed_vs_poisson_precis}
# compare to previous estimates
precis(fit,prob=0.9)
```
Comparing the coefficient estimates from this over-dispersed model to the standard Poisson model, it is clear that the original model was over-estimating the magnitude of association for gender and race. Now boys are only predicted to play about 2.34 times as much as girls, and white students are expected to play about 84% as much as nonwhite students.
Moreover, the estimate of `theta` is 8.19. If `theta` were zero, that would tell us that there is no difference at all between the over-dispersed and standard Poisson distributions. There are no hard rules about values of `theta`, but anything over 1 or 2 is good justification for using an over-dispersed model.
More importantly, we can use a posterior predictive plot to see if we are doing a better job fitting our population data.
```{r posterior_predictive_comparison1}
# actual data:
plot(hour_counts,xlim=c(0,100),main="Actual data")
# over-dispersed
pp_od <- sim(fit_od,n=500)
pcounts_poisson_od <- table(pp_od)
plot(pcounts_poisson_od,xlim=c(0,100),main="Over-dispersed Poisson posterior pred.")
# for comparison, the standard poisson posterior prediction
plot(pcounts_poisson,xlim=c(0,100),main="Standard Poisson posterior pred.")
```
The over-dispersed model clearly does a *much* better job fitting our data.
At this point we should be pretty happy with the over-dispersed Poisson model, but for the sake of demonstation, let's try a different modification of the standard Poisson.
### Zero-inflated Poisson
The zero-inflated Poisson model postulates a two-stage process for generating data. Very often with count data, there are more zero-counts than a Poisson distribution can account for. In many social processes, certain members of the population have a count of zero simply because they never had an opportunity otherwise. In the case of our current model, we might think that only certain students own a gaming console at home or live close to an arcade (remember, this data is from 1990), and the students who do not will always play zero hours per week. Of course, some of the students who do have a console or live near an arcade will still not play games last week. In a zero-inflated Poisson regression, the zero outcomes come from two sources: lack of opportunity and the zeros included in a Poisson distribution.
The key point here is that we don't need to observe the opportunity to play games (if we could observe it, we would simply only model students who did have such an opportunity). Instead, we just model the probability of opportunity as an unobserved characteristic of each student.
This means that there are two models embedded in one: a binomial model determining opportunity to play games, and a Poisson model determining hours played for those with the opportunity to play at all.
```{r model_zeroinflated, cache=TRUE}
# build a zero-inflated model
# dzipois(p,labda) is a wrapper
# around a mixture model, where the outcome
# is deterministically equal to zero with probability p
# and is Poisson distributed with rate lambda otherwise.
m_zi <- alist(
hours_games ~ dzipois(p,lambda), # change the outcome distribution
logit(p) <- ap + bp_m*male + bp_w*race_white + bp_i*fam_logincome, # binomial parameters
log(lambda) <- al + bl_m*male + bl_w*race_white + bl_g*grade_c10, # poisson parameters
# coefficient priors for binomial portion
ap ~ dnorm(0,1),
bp_m ~ dnorm(0,1),
bp_w ~ dnorm(0,1),
bp_i ~ dnorm(0,1),
# coefficient priors for Poisson portion
al ~ dnorm(3,1),
bl_m ~ dnorm(0,0.5),
bl_w ~ dnorm(0,0.5),
bl_g ~ dnorm(0,0.5)
)
# which rows have no NAs for `hours_games`?
d_slim <- d[c('hours_games','male','race_white','grade_c10','fam_logincome')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using map on only complete cases
fit_zi <- ulam(m_zi, data=d_slim, chains = 4, cores = 4)
```
```{r}
precis(fit_zi)
```
The first four coefficients estimate the probability that student will *not* have an opportunity to play games (will have a count of zero deterministically). We see that boys, nonwhite students, and students from wealthier families are less likely to fall into this category (e.g. are more likely to have access to video games).
The last four coefficients estimate the rate of game playing for students who do have an opportunity to play games (remember, they may still play zero hours). Accounting for zero-inflation greatly increased the average baseline number of hours predicted to about exp(1.4)=4.1 hours per week. At the same time, it decreased the magnitude of the other cofficients. Boys are only expected to play about 1.72 times as much as girls in this model. This is because much of the gendered effect in time spent paying has been absorbed into the binomial portion of the model: girls are less likely to have an opportunity to play games than boys.
There is a lot more to unpack in those coefficient estimates, but for now let's just look at the posterior predictive plot to see how our model stacks up.
```{r}
# actual data:
plot(hour_counts,xlim=c(0,100),main="Actual data")
# zero-inflated
pp_zi <- sim(fit_zi,n=500)
pcounts_poisson_zi <- table(pp_zi)
plot(pcounts_poisson_zi,xlim=c(0,100),main="Zero-inflated Poisson posterior pred.")
# for comparison, over-dispersed
plot(pcounts_poisson_od,xlim=c(0,100),main="Over-dispersed Poisson posterior pred.")
```
Our zero-inflated model still seems to do a better job than the standard Poisson, but it ends up under-estimating the nubmer of students who will play just a couple of hours of games per week. The embedded Poisson distribution still has an overly narrow variance.
The clear winner among these models is theh over-dispersed Poisson.
### Formal comparison
We can use WAIC to formally compare the fit of the three models we've looked at:
```{r compareall}
compare(fit,fit_od,fit_zi)
```
What went wrong? The first two models were based on the same data, but the third (zero-inflate) model adds family income as a predictor. When we drop missing values from the data for that estimate, we end up with a smaller data set than for the first two estimates.
***To compare models with WAIC, the must be estmated using the exact same observations,*** even if we do not use all of the variables as predictors in all of the models. To compare the models fit we need to re-estimate using the same data across all three.
(*Note*: we also need to tell the models to save the log likelihood of our posterior samples when estimating by specifying `log_lik=TRUE` in our call to `ulam()`)
```{r compareall_refit, cache=TRUE}
# d_slim contains variables used in any of the three models
d_slim <- d[c('hours_games','male','race_white','grade_c10','fam_logincome')]
d_slim <- d_slim[complete.cases(d_slim),]
# Note: the model stays the same!
# re-estimate the standard poisson
fit <- ulam(m, data=d_slim,cores=4,chains=4, log_lik = TRUE)
# re-estimate the overdispersed poisson
fit_od <- ulam(m_od, data=d_slim,cores=4,chains=4, log_lik = TRUE)
# re-estimate the zero-inflated poisson
fit_zi <- ulam(m_zi, data=d_slim,cores=4,chains=4, log_lik = TRUE)
```
(The code above will take a LONG time to finish)
```{r compareall_2}
compare(fit,fit_od,fit_zi)
```
This confirms our intuition that the standard Poisson model provided a bad fit and that the overdispersed model provides the best fit (by a long shot).