---
title: "SOCI 620: Lab 10"
author: "Peter McMahan"
date: "2/15/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
# this will tell rethinking to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
# Predicting video game use
We will use another set of variables from the first wave of Add Health (collected in the United States from 1990 to 1991) to look at the time students spent playing video games.
Start by loading the data and looking at the first few rows to get a sense of it.
```{r load_data}
d <- read.csv("https://soci620.netlify.com/data/media.csv")
head(d)
```
In additionn to demographic data on gender, grade, race, ethnicity, and family income, this data lists the number of hours per week the students report spending consuming various media. We will be looking at `hours_games`, which codes students' responses to the question "How many hours a week do you play video or computer games?"
Start by looking at an overall summary variable and seeing how many values are missing.
```{r inspect_data}
summary(d$hours_games)
# how many missing (NA)?
sum( is.na(d$hours_games) )
# proportion NA?
mean( is.na(d$hours_games) )
# less than 0.2% missing
```
```{r inspect_data_detail}
# for discrete data a table can give a quick summary of the values
tg <- table(d$hours_games)
tg
# What proportion reported playing no games?
mean(d$hours_games==0,na.rm=TRUE)
```
```{r inspect_data_plot}
# plot the table for a nice visualization
plot(tg)
```
So to summarize, about half of the students say they played no video games. A significant number also reported spending a suspiciously large amount of time playing, at least for the 1990s (8 hours per weekday and 16 hours per day on the weekend is 72 hours).
## Intercept-only model
To build a Poisson model, we will start by buil with just an intercept to get a feel for the approach:
$$
\begin{aligned}
H_i &\sim Pois(\lambda)\\
\log(\lambda) &= \alpha\\
\alpha &\sim Norm(\mbox{??},\mbox{??})
\end{aligned}
$$
Before we estimate this model, we need to figure out a good prior for α by looking at different priors' implications for λ. To get the implied prior for λ, you simply exponentiate the values of the prior for α. Performing this transformation on a normal distribution gives you a new distribution that is already built in to R: the log-normal distribution. This makes it easy to experiment with.
```{r intercept_prior_search1}
# what is the range of lambda that is reasonable in our situation?
# Since there are only 168 hours in a week, we can look at the
# range [0,170]
x <- seq(0,170,length.out=500)
# Try alpha ~ Norm(3,10)
# the function `dlnorm` is the density of the log-normal distribution
plot(x,dlnorm(x,3,10), type='l')
```
That's pretty bad. It assumes that the average time spent playing video games is very low. Let's try some more values for the hyperparameters on the same plot
```{r intercept_prior_search2}
#
plot(NA,xlim=range(x),ylim=c(0,0.045))
lines(x,dlnorm(x,3,0.5), lty=1)
lines(x,dlnorm(x,3,1), lty=2)
lines(x,dlnorm(x,4,0.5), lty=3)
lines(x,dlnorm(x,4,1), lty=4)
legend('topright',lty=1:4,legend=c(
'N(3,0.5)',
'N(3,1)',
'N(4,0.5)',
'N(4,1)'
))
```
Remember, our prior over $\alpha$ should allow for a range of possible values of $\lambda$, and $\lambda$ is the *average* number of hours per week across all of the students.
If we pick the second line from the above plot ($\mathrm{Norm}(3,1)$), that will give the highest prior likelihood to an average of 0 to 50 hours per week. But the 'fat' tail on the right side means that our prior isn't too opinionated, and it will not insist on a low value if the data disagrees with it.
Using that prior, we can estimate our model:
```{r estimate_intercept_only_model, cache=TRUE, output=FALSE}
# Step 1: specify the model
# (NOTE this does not make any reference to our data `d`.)
m1 <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- alpha,
alpha ~ dnorm(3,1)
)
# make a 'slim' copy of the data
d_slim <- d[c('male','hours_games')]
d_slim <- d_slim[complete.cases(d_slim),]
# estimate the model using ulam on only complete cases
fit1 <- ulam(m1,data=d_slim,chains=4,cores=4,iter=4000)
```
```{r intercept_only_results}
# examine results
precis(fit1,prob=0.9)
# exponentiate the coefficient(s)
exp(coef(fit1))
```
$\exp(\hat\alpha)$ is the mean of the posterior distribution of the average rate of video game consumption by our students. This means that we might predict that an "average" student would play about 2.84 hours of video games per week.
We can see more by extracting samples.
Note, when using `ulam()`, it extracts samples during estimation. If you try to get more samples than it originally gathered (default 1000 ⨉ number of chains), it will give return `NA` for any beyond that limit. To get more you have to give it a value of `iter` that is at least twice the number of samples you want. (We will talk about why this is when we discuss MCMC and other Monte-Carlo methods.)
We fit four chains with 4000 iterations each, so we have 8000 samples to work with.
```{r intercept_only_pplot1}
samp1 <- extract.samples(fit1,8000)
plot(density(samp1$alpha))
```
To get this in terms of actual rates, just exponentiate the posterior sample.
```{r intercept_only_pplot2}
lambda_post <- exp(samp1$alpha)
plot(density(lambda_post))
```
```{r intercept_only_CI}
quantile(lambda_post,c(0.05,0.95))
```
This tells us a little bit more: we are 90% confident that the average rate of game use among these students is between about 2.80 and 2.88 hours per week.
## Adding a covariate: do boys play more?
We can us a Poisson regression to estimate the difference in weekly rates of video game playing by boys and girls (keeping in mind the problematic way that Add Health collects data on gender).
$$
\begin{aligned}
H_i &\sim Pois(\lambda_i)\\
\log(\lambda_i) &= \alpha + \beta M_i\\
\alpha &\sim Norm(4,1)\\
\beta &\sim Norm(0,0.5)\\
\end{aligned}
$$
It is usally a good idea for your non-intercept coefficients to have a mean of zero. This is because $\exp(0)=1$, so a value of zero for $\beta$ would mean being male has no association.
For the standard deviation on a coefficient prior, you almost always want a pretty narrow distribution when using a Poisson regression. We'll use 0.5 here, but you can experiment with others. Prior predictive plots can be useful here.
```{r gender_model, cache=TRUE}
# build the model
m2 <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- alpha + beta*male,
alpha ~ dnorm(4,1),
beta ~ dnorm(0,0.5)
)
# fit
fit2 <- ulam(m2,data=d_slim,chains=4,cores=4,iter=4000)
```
```{r gender_model_precis}
# estimates on the scale of the linear model
precis(fit2)
# point estimates in terms of lambda:
exp(coef(fit2))
```
This tells a very different story. Girls in the sample play an average of about 1.50 hours per week, while boys play about 2.83 times more than girls:
$$
\begin{aligned}
\exp(\alpha + \beta) &= \exp(\alpha)\times\exp(\beta)\\
\exp(0.38 + 1.10) &= \exp(0.38)\times\exp(1.10) \\
1.46\times2.99&=4.37
\end{aligned}
$$
Let's compare the posterior distriibutions of $\lambda$ for boys and girls.
```{r gender_model_pplots}
samp2 <- extract.samples(fit2,8000)
# get 'density' objects for the two distributions
pdens_girls <- density(exp(samp2$alpha))
pdens_boys <- density(exp(samp2$alpha + samp2$beta))
# figure out proper ranges to fit both densities on the same plot
xlim <- range(pdens_girls$x,pdens_boys$x,0)
ylim <- range(pdens_girls$y,pdens_boys$y,0)
# plot a blank canvas with those axes
plot(NA,xlim=xlim,ylim=ylim,xlab='rate',ylab='density')
# add each density
lines(pdens_girls,lty=1)
lines(pdens_boys,lty=3)
# and a legend
legend('topright',legend=c('boys','girls'),lty=c(3,1))
```
```{r gender_model_pplot2}
# plot them separately to see more detail
plot(pdens_girls)
plot(pdens_boys)
```
## Adding grade
We will use students' grade level, centered at 10th grade, as a further predictor.
```{r create_centered_grade}
# center
d$grade_c10 <- d$grade-10
# plot
plot(table(d$grade_c10))
```
```{r gender_grade_model, cache=TRUE}
# build the model
m3 <- alist(
hours_games ~ dpois(lambda),
log(lambda) <- alpha + beta_m*male + beta_g*grade_c10,
alpha ~ dnorm(4,1),
beta_m ~ dnorm(0,0.5),
beta_g ~ dnorm(0,0.3) # grade has a bigger range
)
# slim data
d_slim <- d[,c('hours_games','male','grade_c10')]
d_slim <- d_slim[complete.cases(d_slim),]
# fit
fit3 <- ulam(m3,data=d_slim,chains=4,cores=4,iter=4000)
```
```{r gender_grade_model_precis}
precis(fit3,prob=.9)
exp(coef(fit3))
```
The values for $\alpha$ and $\beta_M$ did not change much, but the estimate for $\beta_G$ (the coefficient on grade level) is negative. Older students tend to report playing games for less time than younger students.
Specifically, our model predicts that for each grade level a student achieves, their rate of playing video games will be multiplied by about 0.858. This corresponds to an annual decrease of about 14.2%.