---
title: "SOCI 620: Lab 18"
author: "Peter McMahan"
date: "2023-03-16"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
## Random intercept models
Random intercept models are the simplest form of random-effects models (aka multilevel models or partial-pooling models). They allow you to say that each group has a different average outcome, but that the relationship between the outcome and the covariates is invariant across groups.
Multilevel (hierarchical) models also provide a framework for interpreting the change in an outcome variable as it relates to covariates at the level of both the individual observation and of the group. We will look at both cases here, in relation to reading test scores among grade-one students.
We will use all of the grade-one data in the Tennessee STAR data set, which has over 6000 students in more than 300 classes.
```{r data}
d <- read.csv("https://soci620.netlify.com/data/t_star_student_grade1.csv")
head(d)
```
## Random intercept-only model
To begin, we will use the same random intercept model (with *no* covariates) as we used in the previous lab:
$$
\begin{aligned}
S_{ik} &\sim \mathrm{Norm}(\mu_{ik},\sigma)\\
\mu_{ik} &= \alpha_k\\
\\
\alpha_k &\sim \mathrm{Norm}(\gamma,\eta)\\
\\
\sigma &\sim \mathrm{Cauchy}(0,50)\\
\gamma &\sim \mathrm{Norm}(500,100)\\
\eta &\sim \mathrm{Cauchy}(0,50)
\end{aligned}
$$
```{r model_interceptonly}
# model
model_interceptonly <- bf(
reading_score ~ 1 + (1 | teacher_id)
)
# prior
prior_interceptonly <- c(
prior(cauchy(0,50),class='sigma'), # σ
prior(normal(500,100),class="Intercept"), # 𝛾
prior(cauchy(0,50),class='sd') # η
)
```
```{r fit_interceptonly, cache=TRUE}
fit_interceptonly <- brm(model_interceptonly, data=d,prior=prior_interceptonly)
```
```{r fit_interceptonly_summary}
summary(fit_interceptonly)
```
Even without predictor variables, this model shows us that there is a significant amount of between-class variation (as estimated in `phi`).
### Including student-level covariates
We will start by seeing how students' age affects their test scores, taking into account inter-class variation.
```{r scale_age}
# construct a standardized age variable from the `born` variable
d$age_s <- scale( -d$born )
plot(density(d$age_s,na.rm=TRUE))
```
```{r model_age}
# model
model_age <- bf(
reading_score ~ 1 + age_s + (1 | teacher_id)
)
get_prior(model_age,data=d)
```
```{r prior_age}
prior_age <- c(
prior(cauchy(0,50),class='sigma'),
prior(normal(500,100),class="Intercept"),
prior(cauchy(0,50),class='sd'),
prior(normal(0,30),class='b') # for age
)
```
```{r fit_age, cache=TRUE}
fit_age <- brm(model_age, data=d,prior=prior_age)
```
```{r fit_age_summary}
summary(fit_age)
```
Age has a negative relationship with test scores: or every increase of one standard deviation increase in age, the model predicts that student will score about five points lower.
### Including classroom-level covariates
Very often, it is important to be able to look at the relationship between an outcome variable and a covariate that only varies at the group level. This is exactly the case in the STAR data if we are interested in the effect of the experimental treatment. Do smaller classes lead to better test scores? A model that estimates this effect, allowing between-class variation inn average scores, be written mathematically (omitting priors for now) as
$$
\begin{aligned}
S_{ik} &\sim \mathrm{Norm}(\mu_{ik},\sigma)\\
\mu_{ik} &= \beta_{0k} + \beta_{1}Age_i\\
\\
\beta_{0k} &= \gamma_{0} + \gamma_1Small_k+ \eta_{0k}\\
\eta_{0k} &\sim \mathrm{Norm}(0, \phi_0)
\end{aligned}
$$
To determine how to specify this model in `brms`, note that we can substitute $\beta_{0k}$ into the individual-level linear model and rearrange, giving us
$$
\begin{aligned}
S_{ik} &\sim \mathrm{Norm}(\mu_{ik},\sigma)\\
\mu_{ik} &= \gamma_{0} + \gamma_1Small_k +\beta_{1}Age_i+ \eta_{0k}\\
\\
\eta_{0k} &\sim \mathrm{Norm}(\gamma_0, \phi_0)
\end{aligned}
$$
The classroom-level covariate can be specified as a student-level covariate in this case, so the changes to the formula are quite minimal:
```{r model_age_size}
# create an indicator variable for 'small' classes
d$small <- d$class_type == 1
# specify the model
model_age_size <- bf(
reading_score ~ 1 + age_s + small + (1 | teacher_id)
)
# look at what kind of prior we need
get_prior(model_age_size,data=d)
```
```{r prior_age_size}
prior_age_size <- c(
prior(cauchy(0,50),class='sigma'),
prior(normal(500,100),class="Intercept"),
prior(cauchy(0,50),class='sd'),
prior(normal(0,30),coef='age_s'), # for age
prior(normal(0,50),coef='smallTRUE') # for small classes
)
```
```{r fit_age_size, cache=TRUE}
fit_age_size <- brm(model_age_size, data=d,prior=prior_age_size)
```
```{r fit_age_size_summary}
summary(fit_age_size)
```
The `ranef()` function extracts the group-level estimates from the model (sometimes called the "random effects"). For more complex models -- i.e. those with multiple grouping variables or random slopes -- this can be a lot of information. To deal with this, `ranef()` gives you a list of 3-dimensional arrays (think of a table with rows, columns, and layers, or an Excel workbook with multiple worksheets). This seems like a lot of added complexity, but it lets you access specific results easily.
```{r ranef1}
# first just store the group level effects
# in a variable named `re` for easier typing :)
re <- ranef(fit_age_size)
```
Now,`names()` will give us the grouping variables, which in our case is just `teacher_id
```{r ranef2}
names(re)
```
You can use `dim()` to see the sizes of the dimensions of the output for `teacher_id`
```{r ranef3}
dim(re$teacher_id)
```
Using `dimnames()` can give us a bit more information on what these dimensions are:
```{r ranef 3}
dimnames(re$teacher_id)
```
You can see in the above output that the first dimension indexes the classrooms (or the teacher ids), the second dimension indexes the different summary statistics (estimate, quantiles, etc), and the third dimension indexes the coefficients (here just "Intercept" since we only have group-level effects for the intercept).
So if we want to access, say, the posterior mean ("Estimate") of the intercept for classroom 187, we could use R's indexing:
```{r ranef4}
re$teacher_id["187", "Estimate", "Intercept"]
```
This means that the average score in classroom 187 is expected to be about `r re$teacher_id["187", "Estimate", "Intercept"]` points less than the overall average.
To see the estimate along with all of the other posterior summary statistics for classroom 187, just leave the second part of the bracketed index blank:
```{r ranef5}
re$teacher_id["187", , "Intercept"]
```
Similarly, to see the estimate for all of the classrooms, leave the first part of the index blank:
```{r ranef6}
re$teacher_id[, "Estimate", "Intercept"]
```
This would allow you to summarize the classroom differences, e.g. with a histogram:
```{r ranef_hist}
hist(re$teacher_id[, "Estimate", "Intercept"])
```