In [1]:

```
#####
# set up the environment
#####
# load packages
library(rethinking)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
# adjust the plot region for viewing on the projector in class
options(repr.plot.width=12, repr.plot.height=8)
```

We will use another set of variables from the first wave of Add Health (collected in the United States from 1990 to 1991) to look at the time students spent playing video games.

Start by loading the data and looking at the first few rows to get a sense of it.

In [2]:

```
d <- read.csv("https://soci620.netlify.com/data/media.csv")
head(d)
```

male | race_white | race_black | race_american_indian | race_asian_pacificislander | race_other | hispanic | grade | fam_income | fam_logincome | hours_tv | hours_videos | hours_games | hours_radio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

<lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <int> | |

1 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | NA | 55000 | 10.915107 | 8 | 15 | 0 | 99 |

2 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 11 | NA | NA | 33 | 4 | 0 | 2 |

3 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | 10 | 45000 | 10.714440 | 20 | 0 | 0 | 40 |

4 | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 12 | 9000 | 9.105091 | 24 | 6 | 7 | 7 |

5 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 12 | 2000 | 7.601402 | 5 | 0 | 0 | 2 |

6 | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 7 | 70000 | 11.156265 | 14 | 0 | 0 | 14 |

In additionn to demographic data on gender, grade, race, ethnicity, and family income, this data lists the number of hours per week the students report spending consuming various media. We will be looking at `hours_games`

, which codes students' responses to the question "How many hours a week do you play video or computer games?"

Start by looking at an overall summary variable and seeing how many values are missing.

In [3]:

```
# how many missing (NA)?
sum(is.na(d$hours_games))
# proportion NA?
mean(is.na(d$hours_games))
# less than 0.2% missing
```

12

0.0018450184501845

In [4]:

```
# for discrete data a table can give a quick summary of the values
tg <- table(d$hours_games)
as.matrix(tg)
# What proportion reported playing no games?
mean(d$hours_games==0,na.rm=TRUE)
```

0 | 3076 |
---|---|

1 | 1013 |

2 | 727 |

3 | 362 |

4 | 199 |

5 | 263 |

6 | 103 |

7 | 116 |

8 | 51 |

9 | 18 |

10 | 174 |

11 | 7 |

12 | 46 |

13 | 1 |

14 | 45 |

15 | 66 |

16 | 6 |

17 | 1 |

18 | 9 |

19 | 2 |

20 | 60 |

21 | 15 |

23 | 1 |

24 | 13 |

25 | 15 |

26 | 1 |

28 | 6 |

30 | 24 |

32 | 1 |

34 | 1 |

35 | 16 |

36 | 1 |

37 | 1 |

38 | 1 |

40 | 12 |

42 | 6 |

45 | 2 |

48 | 2 |

49 | 5 |

50 | 9 |

60 | 4 |

64 | 1 |

70 | 2 |

78 | 1 |

80 | 1 |

84 | 1 |

96 | 1 |

99 | 4 |

0.473813924830561

In [5]:

```
# plot the table for a nice visualization
plot(tg)
```

So to summarize, about half of the students say they played no video games. A significant number also reported spending a suspisciously large amount of time playing, at least for the 1990s (8 hours per day and 16 hours per day on the weekend is 72 hours).

Start by building a model with just an intercept to get a feel for the Poisson regression model:

$$ H_i \sim Pois(\lambda)\\ \log(\lambda) = \alpha\\ \alpha \sim Norm(\mbox{??},\mbox{??}) $$Before we estimate this model, we need to figure out a good prior for α by looking at different priors' implications for λ. To get the implied prior for λ, you simply exponentiate the values of the prior for α. Performing this transformation on a normal distribution gives you a new distribution that is already built in to R: the log-normal distribution. This makes it easy to expiriment with.

In [6]:

```
# what is the range of lambda that is reasonable in our situation?
# Since there are only 168 hours in a week, we can look at the
# range [0,170]
x <- seq(0,170,length.out=500)
# Try alpha ~ Norm(3,10)
# the function `dlnorm` is the density of the log-normal distribution
plot(x,dlnorm(x,3,10),type='l')
```

In [7]:

```
# that's sort of bad. Let's try some more on the same plot
plot(NA,xlim=range(x),ylim=c(0,0.045))
lines(x,dlnorm(x,3,0.5),lty=1)
lines(x,dlnorm(x,3,1),lty=2)
lines(x,dlnorm(x,4,0.5),lty=3)
lines(x,dlnorm(x,4,1),lty=4)
legend('topright',lty=1:4,legend=c(
'N(3,0.5)',
'N(3,1)',
'N(4,0.5)',
'N(4,1)'
))
```

Remember, our prior over $\alpha$ should allow for a range of possible values of $\lambda$, and $\lambda$ is the *average* number of hours per week across all of the students.

If we pick the second line from the above ploot ($\mathrm{Norm}(3,1)$), that will give the highest prioor likelihood to an average of 0 to 50 hours per week. But the 'fat' tail on the right side means that our prior isn't too opinionated, and it will not insist on a low value if the data disagrees with it.

Using that prior, we can estimate our model:

In [8]:

```
# Step 1: specify the model
# (NOTE this does not make any reference to our data `d`.)
m1 <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- alpha,
alpha ~ dnorm(3,1)
)
```

In [9]:

```
# make a 'slim' copy of the data
d_slim <- d[,c('hours_games')]
d_slim <- d[complete.cases(d),]
# estimate the model using ulam on only complete cases
fit1 <- ulam(m1,data=d_slim,chains=4,cores=4,iter=4000)
```

In [10]:

```
# examine results
precis(fit1,prob=0.9)
# exponentiate the coefficient(s)
exp(coef(fit1))
```

result | |
---|---|

<dbl> | |

mean | 1.044266e+00 |

sd | 8.672314e-03 |

5% | 1.030004e+00 |

95% | 1.058537e+00 |

n_eff | 2.719386e+03 |

Rhat | 1.001124e+00 |

$\exp(\hat\alpha)$ is the median of the posterior distribution of the average rate of video game consumption by our students. This means that we might predict that an "average" student would play about 2.84 hours of video games per week.

We can see more by extracting samples.

Note, when using `ulam()`

, it extracts samples during estimation. If you try to get more samples than it originally gathered (default 1000 ⨉ number of chains), it will give return `NA`

for any beyond that limit. To get more you have to give it a value of iter that is at least twice the number of samples you want. (We will talk about why this is when we discuss MCMC and other Monte-Carlo methods.)

We fit four chains with 4000 iterations each, so we have 8000 samples to work with.

In [11]:

```
samp1 <- extract.samples(fit1,8000)
plot(density(samp1$alpha))
```

To get this in terms of actual rates, just exponentiate the posterior sample.

In [12]:

```
lambda_post <- exp(samp1$alpha)
plot(density(lambda_post))
```

In [13]:

```
quantile(lambda_post,c(0.05,0.95))
```

- 5%
- 2.8010779620475
- 95%
- 2.88215248106457

This tells us a little bit more: we are 90% confident that the average rate of game use among these students is between about 2.80 and 2.88 hours per week.

We can us a Poisson regression to estimate the difference in weekly rates of video game playing by boys and girls (keeping in mind the problematic way that Add Health collects data on gender).

$$ H_i \sim Pois(\lambda_i)\\ \log(\lambda_i) = \alpha + \beta M_i\\ \alpha \sim Norm(4,1)\\ \beta \sim Norm(0,0.5)\\ $$It is usally a good idea for your non-intercept coefficients to have a mean of zero. This is because $\exp(0)=1$, so a value of zero for $\beta$ would mean being male has no association.

For the standard deviation on a coefficient prior, you almost always want a pretty narrow distribution when using a Poisson regression. We'll use 0.5 here, but you can experiment with others. Prior predictive plots can be useful here.

In [14]:

```
# build the model
m2 <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- alpha + beta*male,
alpha ~ dnorm(4,1),
beta ~ dnorm(0,0.5)
)
```

In [15]:

```
# make a 'slim' dataset
d_slim <- d[,c('hours_games','male')]
d_slim <- d[complete.cases(d),]
# fit
fit2 <- ulam(m2,data=d_slim,chains=4,cores=4,iter=4000)
```

In [16]:

```
# estimates on the scale of the linear model
precis(fit2)
# point estimates in terms of lambda:
exp(coef(fit2))
```

mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |

alpha | 0.4036448 | 0.0162388 | 0.378083 | 0.4298521 | 1717.260 | 1.000900 |

beta | 1.0414541 | 0.0187000 | 1.011314 | 1.0712561 | 1805.542 | 1.000869 |

- alpha
- 1.49727204845273
- beta
- 2.83333404338641

This tells a very different story. Girls in the sample play an average of about 1.50 hours per week, while boys play about 2.83 times more than girls:

\begin{aligned} \exp(\alpha + \beta) &= \exp(\alpha)\times\exp(\beta)\\ \exp(0.38 + 1.10) &= \exp(0.38)\times\exp(1.10) \\ 1.46\times2.99&=4.37 \end{aligned}Let's compare the posterior distriibutions of $\lambda$ for boys and girls.

In [17]:

```
samp2 <- extract.samples(fit2,8000)
# get 'density' objects for the two distributions
pdens_girls <- density(exp(samp2$alpha))
pdens_boys <- density(exp(samp2$alpha + samp2$beta))
# figure out proper ranges to fit both densities on the same plot
xlim <- range(pdens_girls$x,pdens_boys$x,0)
ylim <- range(pdens_girls$y,pdens_boys$y,0)
# plot a blank canvas with those axes
plot(NA,xlim=xlim,ylim=ylim,xlab='rate',ylab='density')
# add each density
lines(pdens_girls,lty=1)
lines(pdens_boys,lty=3)
```