---
title: "SOCI 620: Lab 21"
author: "Peter McMahan"
date: "2023-03-28"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
# Detailed 2-level model with covariance
This lab will walk you through the estimation of a detailed 2-level model using `brms`, including an LKJ prior on a covariance matrix.
## Pre-processing the data
The model we will be using is the same one from class. We will be predicting the scores on a reading exam for students in grade one in Tennessee in the early 1980s.
```{r data}
d <- read.csv("https://soci620.netlify.app/data/t_star_student_grade1_2l.csv")
dim(d)
# for the sake of speed, we will use a 20% subsample
d <- d[runif(nrow(d)) <= 0.2,]
dim(d)
head(d)
```
Since we will only be modeling Black and white students, keep only rows that are one or the other. (***Note**: this is very bad statistical practice. We're doing it here just to keep the model a little bit cleaner*)
```{r data_restrict_race}
d <- d[d$student_re_white | d$student_re_black,]
dim(d)
```
Finally, we will standardize age and center class size and teacher experience to help with estimation and interpretation:
```{r standardize and center}
d$student_age_s <- scale(-d$student_born)
# since we are averaging over classes rather than students,
# we need to make sure to take only one row from each class
data_by_classroom <- unique(
d[c('teacher_id', 'class_size', 'teacher_exper')]
)
meansize <- mean(data_by_classroom$class_size, na.rm=TRUE)
print(meansize)
d$class_size_c <- d$class_size - meansize
meanexp <- mean(data_by_classroom$teacher_exper,na.rm=TRUE)
print(meanexp)
d$teacher_exper_c <- d$teacher_exper - meanexp
# look at the results
hist(d$class_size_c)
hist(d$teacher_exper_c)
```
## Specify the model
The first step in specifying the model in R is to convert from the 'standard' hierarchical notation to the 'expanded' single-equation notation. Recall from the lecture that the model we are estimating looks like this (priors omitted … for now).
$$
\begin{aligned}
S_{ik} &\sim \mathrm{Norm}(\mu_{ik},\sigma)\\
\mu_{ik} &= \beta_{0k} + \beta_{1k}Age_i + \beta_{2k}Female_i + \beta_{3k}Black_i\\
\\
\beta_{0k} &= \gamma_{00} + \gamma_{01}Size_k + \gamma_{02}Exp_k +\\
&\phantom{{}={}}\gamma_{03}TeacherFemale_k+\\
&\phantom{{}={}}\gamma_{04}TeacherBlack_k+\\
&\phantom{{}={}}\gamma_{05}PropBlack_k+\eta_{0k}\\
\beta_{1k} &= \gamma_{10} + \gamma_{11}Size_k + \eta_{1k}\\
\beta_{2k} &= \gamma_{20} + \gamma_{21}TeacherFemale_k + \eta_{2k}\\
\beta_{3k} &= \gamma_{30} + \gamma_{31}TeacherBlack_k + \\
&\phantom{{}={}}\gamma_{32}PropBlack_k+\eta_{3k}
\end{aligned}
$$
When we substitute all the $\beta$ terms into the main equation, and replace $\sigma$ with an $\epsilon_i$ term, we have the following:
$$
\begin{aligned}
S_{ik} &=
\\
&\phantom{{}={}} \gamma_{00} + \gamma_{01}Size_k + \gamma_{02}Exp_k + TeacherFemale_k +\\
&\phantom{{}={}} \gamma_{04}TeacherBlack_k+\gamma_{05}PropBlack_k + \gamma_{10}Age_i\\
&\phantom{{}={}} \gamma_{11}Age_iSize_k+\gamma_{20}Female_i + \gamma_{21}Female_iTeacherFemale_k+\\
&\phantom{{}={}} \gamma_{30}Black_i+\gamma_{31}Black_iTeacherBlack_k + \gamma_{32}Black_iPropBlack_k+\\
\\
&\phantom{{}={}}\eta_{0k} + \eta_{1k}Age_i + \eta_{2k}Female_i + \eta_{3k}Black_i + \epsilon_i
\end{aligned}
$$
The last line (separated slightly from the rest for visual clarity) collects all of the "random-effect" terms with error terms in them.
This representation helps us to specify the model as an R `formula` object. Note that when you specify an interaction term using `*` between two variables, R defaults to including both of the terms on their own as well as the interaction. So `student_re_black*teacher_re_black` adds terms to our model for `student_re_black`, `teacher_re_black`, and their interaction. (If you want to include only the interaction, which happens occassionally, use `:` instead of `*`.)
```{r model1}
model1 = bf(
student_reading_score ~
student_age_s*class_size_c + teacher_exper_c + student_female*teacher_female +
student_re_black*teacher_re_black + student_re_black*class_prop_black +
(1 + student_age_s + student_female + student_re_black | teacher_id)
)
```
But what about the priors? We can use `get_prior()` to have `brms` tell us all the priors it *expects* to see for a model like this:
```{r prior1 prep}
get_prior(model1, data=d)
# we can make the output a little clearer in RStudio if we
# use the View function, but this should be put into the
# console for interactive use, not into your script:
# View(get_prior(model1, data=d))
```
Each row of this output shows either a specific parameter that the model is estimating (e.g. `class_prop_black`) or a class of related parameter (e.g. `b`, which refers to all of the model coefficients).
The first column of the table shows any default priors that `brms` would use if we didn't specify. We can see that it will default to an LKJ prior with a concentration paramter of 1 for the correlation matrix (`cor`) and student-t distribtutions for the intercept (`Intercept`) and the standard deviations (`sd` for within class and `sigma` for between class). Importantly, there is *no prior* set by default for the coefficients (`b`), meaning it will use an improper flat prior over all real numbers unless we tell it otherwise.
We want to specify priors of $\mathrm{Norm}(500,100)$ for $\gamma_{00}$, $\mathrm{Norm}(0,20)$ for the rest of the coefficients, a (half-)Cauchy prior $\mathrm{Cauchy}(0,40)$ for the standard deviation on the $\eta$ terms, and and $\mathrm{LKJ}(2)$ prior for the correlation matrix on the $\eta$ terms.
```{r prior1}
prior1 <- c(
set_prior("normal(500,100)",class="Intercept"),
set_prior("normal(0,20)",class='b'),
set_prior("cauchy(0,40)",class='sd'),
set_prior('lkj(2)',class='cor')
)
```
## Estimating
Estimation is now pretty simple (if slow). We can use the `Sys.time()` function to record the number of seconds that passed while we waited for the estimation to finish.
```{r fit1}
# use Sys.time() to measure how long this takes to run
t0 <- Sys.time() # start time
# fit the model
fit1 <- brm(model1, data=d, prior=prior1)
t1 <- Sys.time() # end time
print(t1-t0)
```
```{r summary1}
summary(fit1)
```
You might have gotten a warning about "divergent transitions after warmup" with the previous output. This is something that *should not be ignored*. The usual way to deal with just a few of these is to follow the instructions in the warning and increase `adapt_delta`. While we are at it, we can increase the number of iterations to get a bigger effective sample size. *Both of these changes will slow down sampling!*
***You only need to run this slower estimate if you got the warnings on the estimate above.***
```{r fit1 adapt_delta}
t0 <- Sys.time()
f <- brm(model1, data=d, prior=prior1, iter=3000, control=list(adapt_delta=.95))
t1 <- Sys.time()
print(t1-t0)
summary(fit1)
```
That looks better!
Now, to interpret the estimates, it will be important to be able to map the output from the `summary()` to the $\gamma$ terms in our model. R formulas use the `:` character to signify interaction between two variables, which will help us link the estimates to our formal model.
As an example, consider the effect of having a Black teacher on the racial gap in test scores for a classroom. This is capture in the last equation of our hierarchical model:
$$
\beta_{3k} = \gamma_{30} + \gamma_{31}TeacherBlack_k + \gamma_{32}PropBlack_k+\eta_{3k}
$$
This means that $\gamma_{31}$ is the coefficient we are interested in. In the 'flattened' model, this is represented by the term $\gamma_{31}Black_iTeacherBlack_k$, which *interacts* $Black_i$ with $TeacherBlack_k$. We find this in the output above labeled `student_re_blackTRUE:teacher_re_blackTRUE`.
# Your turn!
Imagine a policymaker came to you concerned that boys are the only ones who benefit from smaller classes, and that smaller classes might exacerbate gendered divides in math scores. Build and estimate a model to address this concern. (Note: you might not need random slopes on other covariates like race and ethnicity variables here)