In [1]:

```
#####
# set up the environment
#####
# load packages
library(rethinking)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
# adjust the plot region for viewing on the projector in class
options(repr.plot.width=12, repr.plot.height=8)
```

Poisson regressions are often very good models of social processes, but they suffer from the restrictions imposed by the Poisson distribution. These notes look at two common methods for broadening the applicability of a Poisson model: over-dispersion and zero-inflation.

But we'll start with a regular Poisson regression as a baseline for comparison.

In [2]:

```
# load data
d <- read.csv("https://soci620.netlify.com/data/media.csv")
head(d)
```

male | race_white | race_black | race_american_indian | race_asian_pacificislander | race_other | hispanic | grade | fam_income | fam_logincome | hours_tv | hours_videos | hours_games | hours_radio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

<lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <int> | |

1 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | NA | 55000 | 10.915107 | 8 | 15 | 0 | 99 |

2 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 11 | NA | NA | 33 | 4 | 0 | 2 |

3 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | 10 | 45000 | 10.714440 | 20 | 0 | 0 | 40 |

4 | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 12 | 9000 | 9.105091 | 24 | 6 | 7 | 7 |

5 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 12 | 2000 | 7.601402 | 5 | 0 | 0 | 2 |

6 | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 7 | 70000 | 11.156265 | 14 | 0 | 0 | 14 |

In [3]:

```
# create a grade-10-centered grade variable
d$grade_c10 <- d$grade - 10
```

In [4]:

```
# build the model
m <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- a + b_m*male + b_w*race_white + b_g*grade_c10,
a ~ dnorm(3,1),
b_m ~ dnorm(0,0.5),
b_w ~ dnorm(0,0.5),
b_g ~ dnorm(0,0.5)
)
# which rows have no NAs for `hours_games`?
d_slim <- d[c('hours_games','male','race_white','grade_c10')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using ulam on only complete cases
fit <- ulam(m, data=d_slim,cores=4,chains=4)
```

In [5]:

```
precis(fit,prob=0.9)
```

mean | sd | 5% | 95% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |

a | 0.4698132 | 0.017522684 | 0.4395659 | 0.4979347 | 849.2875 | 1.0032955 |

b_m | 1.1173158 | 0.017062791 | 1.0896011 | 1.1457928 | 882.0447 | 0.9992595 |

b_w | -0.3099865 | 0.014782168 | -0.3332418 | -0.2858497 | 1134.3643 | 1.0054908 |

b_g | -0.1398632 | 0.004513603 | -0.1472876 | -0.1323129 | 1751.3187 | 0.9991426 |

We can immidiately tell from these results that boys play more games than girls, white students play fewer games than non-white students, and older students play fewer games than younger students. But to get an idea of the magnitude we need to exponentiate the estimates.

In [6]:

```
# exponentiate the point estimates
exp(coef(fit))
```

- a
- 1.59969529758459
- b_m
- 3.05663856230714
- b_w
- 0.733456880573755
- b_g
- 0.869477185923716

These results predict that white, grade-10 girls will play about 1.6 hours of games per week on average, and that boys will play about 3.06 times as much as girls regardless of age and race. White students are predicted to play about 73.3% as much as nonwhite students. Finally, for each year increase in students' grade, they are predicted to play games about 13.1% less long.

To get an idea of the adequacy of the model, let's compare a posterior prediction of the outcome variable to the actual data.

In [7]:

```
# actual data:
hour_counts <- table(d$hours_games)
plot(hour_counts,xlim=c(0,100),main="Actual data")
# posterior prediction:
# By default sim() uses the same data we fit the model with.
pp <- sim(fit,n=500) # For each individual, make 500 posterior predictions
dim(pp)
pcounts_poisson <- table(pp)
plot(pcounts_poisson,xlim=c(0,100),main="Standard Poisson posterior pred.")
```

- 1000
- 6318

Our model is not doing a great job. It is drastically under-predicting the number of people who play zero hours and more than 15 hours per week, and over-predicting the number of people who play 3-10 hours per week.

Clearly our model is puttinig too much of its posterior prediction close to the center of the distribution. This is very common with a Poisson regression, as the Poisson distibution is fixed at a relatively "narrow" variance. In models of social processes, unobserved processes that influence our outcome will lead to more variation in that outcome. One way to try to address this is to allow more random variation in $\lambda_i$, so that students who look identical on paper can still have distinct rates of game playing.

In [8]:

```
# build an over-dispersed model
# the dgampois() function is a convenient wrapper
# around a Poisson-distributed variable with
# gamma-distributed lambda
m_od <- alist(
hours_games ~ dgampois(lambda,theta), # change the outcome distribution
log(lambda) <- a + b_m*male + b_w*race_white + b_g*grade_c10,
a ~ dnorm(3,1),
b_m ~ dnorm(0,0.5),
b_w ~ dnorm(0,0.5),
b_g ~ dnorm(0,0.5),
theta ~ dunif(0,40) # one more prior for variability of lambda
)
# which rows have no NAs for `hours_games`?
d_slim <- d[c('hours_games','male','race_white','grade_c10')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using ulam on only complete cases
fit_od <- ulam(m_od, data=d_slim, chains=4, cores=4)
```

In [9]:

```
# see estimates
precis(fit_od,prob=0.9)
```

mean | sd | 5% | 95% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |

a | 0.4723538 | 0.042969817 | 0.4035336 | 0.5440153 | 1419.118 | 0.9990836 |

b_m | 1.1510754 | 0.044331316 | 1.0791752 | 1.2221095 | 1675.539 | 0.9988488 |

b_w | -0.3705873 | 0.044317244 | -0.4442663 | -0.2973717 | 1454.244 | 0.9996140 |

b_g | -0.1626505 | 0.013484784 | -0.1838321 | -0.1410307 | 1998.650 | 0.9999026 |

theta | 0.3741737 | 0.009160408 | 0.3593923 | 0.3886204 | 1829.341 | 0.9988236 |

In [10]:

```
# compare to previous estimates
precis(fit,prob=0.9)
```

mean | sd | 5% | 95% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |

a | 0.4698132 | 0.017522684 | 0.4395659 | 0.4979347 | 849.2875 | 1.0032955 |

b_m | 1.1173158 | 0.017062791 | 1.0896011 | 1.1457928 | 882.0447 | 0.9992595 |

b_w | -0.3099865 | 0.014782168 | -0.3332418 | -0.2858497 | 1134.3643 | 1.0054908 |

b_g | -0.1398632 | 0.004513603 | -0.1472876 | -0.1323129 | 1751.3187 | 0.9991426 |

Comparing the coefficient estimates from this over-dispersed model to the standard Poisson model, it is clear that the original model was over-estimating the magnitude of association for gender and race. Now boys are only predicted to play about 2.34 times as much as girls, and white students are expected to play about 84% as much as nonwhite students.

Moreover, the estimate of `theta`

is 8.19. If `theta`

were zero, that would tell us that there is no difference at all between the over-dispersed and standard Poisson distributions. There are no hard rules about values of `theta`

, but anything over 1 or 2 is good justification for using an over-dispersed model.

More importantly, we can use a posterior predictive plot to see if we are doing a better job fitting our population data.

In [11]:

```
# actual data:
plot(hour_counts,xlim=c(0,100),main="Actual data")
# over-dispersed
pp_od <- sim(fit_od,n=500)
pcounts_poisson_od <- table(pp_od)
plot(pcounts_poisson_od,xlim=c(0,100),main="Over-dispersed Poisson posterior pred.")
# for comparison, the standard poisson posterior prediction
plot(pcounts_poisson,xlim=c(0,100),main="Standard Poisson posterior pred.")
```

The over-dispersed model clearly does a *much* better job fitting our data.

At this point we should be pretty happy with the over-dispersed Poisson model, but for the sake of demonstation, let's try a different modification of the standard Poisson.

The zero-inflated Poisson model postulates a two-stage process for generating data. Very often with count data, there are more zero-counts than a Poisson distribution can account for. In many social processes, certain members of the population have a count of zero simply because they never had an opportunity otherwise. In the case of our current model, we might think that only certain students own a gaming console at home or live close to an arcade (remember, this data is from 1990), and the students who do not will always play zero hours per week. Of course, some of the students who do have a console or live near an arcade will still not play games last week. In a zero-inflated Poisson regression, the zero outcomes come from two sources: lack of opportunity and the zeros included in a Poisson distribution.

The key point here is that we don't need to observe the opportunity to play games (if we could observe it, we would simply only model students who did have such an opportunity). Instead, we just model the probability of opportunity as an unobserved characteristic of each student.

This means that there are two models embedded in one: a binomial model determining opportunity to play games, and a Poisson model determining hours played for those with the opportunity to play at all.

In [12]:

```
# build a zero-inflated model
# dzipois(p,labda) is a wrapper
# around a mixture model, where the outcome
# is deterministically equal to zero with probability p
# and is Poisson distributed with rate lambda otherwise.
m_zi <- alist(
hours_games ~ dzipois(p,lambda), # change the outcome distribution
logit(p) <- ap + bp_m*male + bp_w*race_white + bp_i*fam_logincome, # binomial parameters
log(lambda) <- al + bl_m*male + bl_w*race_white + bl_g*grade_c10, # poisson parameters
# coefficient priors for binomial portion
ap ~ dnorm(0,1),
bp_m ~ dnorm(0,1),
bp_w ~ dnorm(0,1),
bp_i ~ dnorm(0,1),
# coefficient priors for Poisson portion
al ~ dnorm(3,1),
bl_m ~ dnorm(0,0.5),
bl_w ~ dnorm(0,0.5),
bl_g ~ dnorm(0,0.5)
)
# which rows have no NAs for `hours_games`?
d_slim <- d[c('hours_games','male','race_white','grade_c10','fam_logincome')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using map on only complete cases
fit_zi <- ulam(m_zi, data=d_slim, chains = 4, cores = 4)
```

In [13]:

```
precis(fit_zi)
```

mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |

ap | 0.87914280 | 0.23982943 | 0.51233616 | 1.27388671 | 1147.466 | 1.0047907 |

bp_m | -1.17650005 | 0.06014537 | -1.26936078 | -1.07871455 | 1872.737 | 1.0037666 |

bp_w | 0.21487865 | 0.06951429 | 0.10489519 | 0.32706462 | 1736.856 | 1.0003130 |

bp_i | -0.06554100 | 0.02291102 | -0.10190241 | -0.03052308 | 1117.642 | 1.0048185 |

al | 1.40190616 | 0.02128665 | 1.36745556 | 1.43718182 | 1256.020 | 0.9997231 |

bl_m | 0.54392405 | 0.01984296 | 0.51186121 | 0.57508155 | 1465.986 | 1.0004065 |

bl_w | -0.28407437 | 0.01816468 | -0.31314656 | -0.25626393 | 1289.979 | 1.0006715 |

bl_g | -0.06190437 | 0.00566002 | -0.07060229 | -0.05310755 | 2455.924 | 0.9993404 |

The first four coefficients estimate the probability that student will *not* have an opportunity to play games (will have a count of zero deterministically). We see that boys, nonwhite students, and students from wealthier families are less likely to fall into this category (e.g. are more likely to have access to video games).

The last four coefficients estimate the rate of game playing for students who do have an opportunity to play games (remember, they may still play zero hours). Accounting for zero-inflation greatly increased the average baseline number of hours predicted to about exp(1.4)=4.1 hours per week. At the same time, it decreased the magnitude of the other cofficients. Boys are only expected to play about 1.72 times as much as girls in this model. This is because much of the gendered effect in time spent paying has been absorbed into the binomial portion of the model: girls are less likely to have an opportunity to play games than boys.

There is a lot more to unpack in those coefficient estimates, but for now let's just look at the posterior predictive plot to see how our model stacks up.

In [14]:

```
# actual data:
plot(hour_counts,xlim=c(0,100),main="Actual data")
# zero-inflated
pp_zi <- sim(fit_zi,n=500)
pcounts_poisson_zi <- table(pp_zi)
plot(pcounts_poisson_zi,xlim=c(0,100),main="Zero-inflated Poisson posterior pred.")
# for comparison, over-dispersed
plot(pcounts_poisson_od,xlim=c(0,100),main="Over-dispersed Poisson posterior pred.")
```