---
title: "SOCI 620: Lab 16"
author: "Peter McMahan"
date: "3/9/2023"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
# Using sample weights in `brms`
In this lab, we will look a little more at model specification in `brms` and then use brms to demonstrate the importance and use of sampling weights.
We will use a slightly different subset of the Tennessee STAR data that includes information on the teachers.
```{r data}
d <- read.csv("https://soci620.netlify.com/data/t_star_student_grade1_2l.csv")
# standardize test scores
# (I'm including this as an example of defining
# your own function. This creates a function `sc`
# that is substantially the same as the `scale`
# function built in to R)
sc <- function(x){
s <- sd(x, na.rm=TRUE)
m <- mean(x, na.rm=TRUE)
(x-m)/s
}
d$class_size_st <- sc(d$class_size)
d$student_reading_score_st <- sc(d$student_reading_score)
d$student_listening_score_st <- sc(d$student_listening_score)
d$student_math_score_st <- sc(d$student_math_score)
head(d)
```
## Building a model
To demonstrate how to use these sampling weights, we will specify a simple regression predicting student math test score using data on their class size and their race and ethnicity. The formula part of the model specification is straightforward:
```{r model_1}
# You can wrap the formula across multiple lines.
model_1 <- bf(
student_math_score_st ~
class_size_st +
student_re_black + student_re_asian + student_re_hispanic +
student_re_native_american + student_re_other,
family = gaussian()
)
```
Before we try to estimate this, we will use the function `get_prior` to figure out how to indicate priors for different parameters.
```{r prior_1_prep}
get_prior(model_1,data=d)
```
```{r prior_1}
prior_1 <- c(
prior(normal(0,2),class=Intercept),
prior(normal(0,2),class=b),
prior(cauchy(0,3),class=sigma),
prior(normal(0,1),coef=class_size_st)
)
```
## Sampling weights
The Tennessee STAR data did not use any kind of stratified sampling, but we can simulate stratified sampling by taking a subsample of the students.
We start by calculating a sampling probability for each student. The students in the full sample are overwhelmingly white, and there are almost no students who are neither white nor Black. To make sure our subsample captures the under-represented races and ethnicities, we will sample white students with a probability of 3%, Black students with a probability of 6% and everyone else with a probability of 100%.
```{r race_ethinicity_freq}
table(d$student_race_ethnicity)
```
```{r generate_sampling_probs}
# by default, get 3% of students
d$psamp <- 0.03
# change that to 6% for Black students
d$psamp[d$student_re_black] <- 0.06
# make sure we sample ALL students whose race/ethnicity is neither Black nor white
d$psamp[!d$student_re_black & !d$student_re_white] <- 1.0
table(d$student_race_ethnicity, d$psamp)
head(d)
```
We can use these sampling probabilities to take a stratified sample of the full population. We also want to create a variable indicating the 'weight' of each student — this can be thought of as the number of students in the full population that each sampled student is supposed to represent.
This is done in R by drawing a random number between 0 and 1 for each student using `runif()`. Then we only keep rows for which that random number is *less than* the sampling probability.
Remember that the sampling *weight* is the inverse of the sampling *probability* ($w = 1/p$)
```{r take_subsample}
# make a new data frame with the subsamples
in_sample <- runif(nrow(d)) <= d$psamp
d_ss <- d[in_sample,]
# create a weight variable 'w'
d_ss$w <- 1/d_ss$psamp
head(d_ss)
table(d_ss$student_race_ethnicity)
```
For comparison, we will also take a subsample withuniform probability, and therefore no weights.
```{r uniform_subsample}
# take an identically sized subsample with uniform probability
in_sample_noweights <- sample(nrow(d),nrow(d_ss))
d_ss_noweights <- d[ in_sample_noweights ,]
table(d_ss_noweights$student_race_ethnicity)
```
### Estimates on a uniform sample
We will start by estimating the model that predicts student test scores based on race/ethnicity from the unweighted subsample.
```{r fit1_noweights, cache=TRUE}
fit1_noweights <- brm(
model_1,prior=prior_1,
data=d_ss_noweights, cores=4,chains=4)
```
```{r fit1_noweights_summary}
summary(fit1_noweights)
prior_summary(fit1_noweights)
```
### Estimates on a stratified sample, using sampling weights
We estimate the same model, but incorporating sample weights and using the stratified sample data
```{r model_1_weighted}
model_1_weighted <- bf(
student_math_score_st | weights(w) ~
class_size_st +
student_re_black + student_re_asian + student_re_hispanic +
student_re_native_american + student_re_other
)
```
```{r fit1_weighted, cache=TRUE}
fit1_weighted <- brm(
model_1_weighted, prior=prior_1,
data=d_ss, cores=4,chains=4)
```
```{r fit1_weighted_summary}
summary(fit1_weighted)
```
### Visualize
We can visualize and compare the estimates from the two models using forest plots. `brms` has a built-in function `mcmc_plot` that makes plotting these easy.
```{r two_forest_plots}
mcmc_plot(fit1_noweights)
mcmc_plot(fit1_weighted)
```
You can also select which paramers you want plotted using the `variable` argument. Here, we also set `regex = TRUE` so we can "match" any variables that have `b_` in them -- this should get all of the coefficients.
```{r two_forest_plots_nosigma}
mcmc_plot(fit1_noweights, variable = 'b_', regex = TRUE)
mcmc_plot(fit1_weighted, variable = 'b_', regex = TRUE)
```
Run the help `?'MCMC-intervals'` to see more options for `mcmc_plot`. Try adjusting the size of the credible intervals displayed using `prob` and `prob_outer`)
Finally, we can get our hands dirty with the underlying `bayesplot` package and `ggplot2` to combine the plots into one for easier comparison.
```{r one_forest_plot}
library(bayesplot)
library(ggplot2)
# extract the appropriate data from the estimates
forest_data_noweights <- mcmc_intervals_data(fit1_noweights, regex_pars = 'b_')
forest_data_weighted <- mcmc_intervals_data(fit1_weighted, regex_pars = 'b_')
# combine them into one
forest_data <- rbind(forest_data_noweights, forest_data_weighted)
forest_data$model <- rep(c("Unweighted","Weighted"), each=nrow(forest_data_noweights))
# use ggplot to offset the vertical position of the two models
pos <- position_nudge(y = ifelse(forest_data$model == "Weighted", -0.05, 0.05))
ggplot(forest_data, aes(x = m, y = parameter, color = model)) +
geom_point(position = pos) +
geom_linerange(aes(xmin = ll, xmax = hh), position = pos)
```
### Bonus: shinystan
One way to look at model outcomes is to use the 'shiny' framework in R. Shiny is a framework for making web interfaces to analyses in R, and there is a package called "shinystan" that can be really useful for summarizing and diagnosing MCMC output like brms produces. To use it, simply call the the `launch_shinystan()` function with your model as an argument. You don't want to run this command when you are knitting your document since it is interactive -- instead you will type/paste put it in the R console. By leaving off the `{r}` at the start of the following code block, I am telling knitr not to run the code at all, just to display it:
```
launch_shinystan(fit1_weighted)
```