---
title: "SOCI 620: Lab 16"
author: "Peter McMahan"
date: "3/15/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
library(pander)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
# Using sample weights in `brms`
In this lab, we will look a little more at model specification in `brms` and then use brms to demonstrate the importance and use of sampling weights.
We will use a slightly different subset of the Tennessee STAR data that includes information on the teachers.
```{r data}
d <- read.csv("https://soci620.netlify.com/data/t_star_student_grade1_2l.csv")
# standardize test scores
sc <- function(x){
s <- sd(x,na.rm=TRUE)
m <- mean(x,na.rm=TRUE)
(x-m)/s
}
d$class_size_st <- sc(d$class_size)
d$student_reading_score_st <- sc(d$student_reading_score)
d$student_listening_score_st <- sc(d$student_listening_score)
d$student_math_score_st <- sc(d$student_math_score)
head(d)
```
## Building a model
To demonstrate how this sampling weights, we will specify a simple regression predicting student math test score using data on their class size and their race and ethnicity. The formula part of the model specification is straightforward:
```{r model_1}
# You can wrap the formula agross multiple lines.
model_1 <- bf(
student_math_score_st ~
class_size_st +
student_re_black + student_re_asian + student_re_hispanic +
student_re_native_american + student_re_other,
family = gaussian()
)
```
Before we try to estimate this, we will use the function `get_prior` to figure out how to indicate priors for different parameters.
```{r prior_1_prep}
get_prior(model_1,data=d)
```
```{r prior_1}
prior_1 <- c(
prior(normal(0,2),class=Intercept),
prior(normal(0,2),class=b),
prior(cauchy(0,3),class=sigma),
prior(normal(0,1),coef=class_size_st)
)
```
## Sampling weights
The Tennessee STAR data did not use any kind of stratified sampling, but we can simulate stratified sampling by taking a subsample of the students.
We start by calculating a sampling probability for each student. The students in the full sample are overwhelmingly white, and there are almost no students who are neither white nor Black. To make sure our subsample captures the under-represented races and ethnicities, we will sample white students with a probability of 3% Black students with a probability of 6% and everyone else with a probability of 100%.
```{r race_ethinicity_freq}
table(d$student_race_ethnicity)
```
```{r generate_sampling_probs}
# by default, get 3% of students
d$psamp <- .03
# double that for Black students
d$psamp[d$student_re_black] <- .06
# make sure we sample ALL students whose race/ethnicity is neither Black nor white
d$psamp[!d$student_re_black & !d$student_re_white] <- 1
table(d$student_race_ethnicity, d$psamp)
head(d)
```
We can use these sampling probabilities to take a stratified sample of the full population. We also want to create a variable indicating the 'weight' of each student — this can be thought of as the number of students in the full population that each sampled student is supposed to represent.
This is done in R by drawing a random number between 0 and 1 for each student using `runif()`. Then we only keep rows for which that random number is *less than* the sampling probability.
Remember that the sampling *weight* is the inverse of the sampling *probability* ($w = 1/p$)
```{r take_subsample}
# make a new data frame with the subsamples
in_sample <- runif(nrow(d)) <= d$psamp
d_ss <- d[in_sample,]
# create a weight variable 'w'
d_ss$w <- 1/d_ss$psamp
head(d_ss)
table(d_ss$student_race_ethnicity)
```
For comparison, we will also take a subsample withuniform probability, and therefore no weights.
```{r uniform_subsample}
# take an identically sized subsample with uniform probability
in_sample_noweights <- sample(nrow(d),nrow(d_ss))
d_ss_noweights <- d[ in_sample_noweights ,]
table(d_ss_noweights$student_race_ethnicity)
```
### Estimates on a uniform sample
We will start by estimating the model that predicts student test scores based on race/ethnicity from the unweighted subsample.
```{r fit1_noweights, cache=TRUE}
fit1_noweights <- brm(
model_1,prior=prior_1,
data=d_ss_noweights, cores=4,chains=4)
```
```{r fit1_noweights_summary}
summary(fit1_noweights)
prior_summary(fit1_noweights)
```
### Estimates on a stratified sample, using sampling weights
We estimate the same model, but incorporating sample weights and using the stratified sample data
```{r model_1_weighted}
model_1_weighted <- bf(
student_math_score_st | weights(w) ~
class_size_st +
student_re_black + student_re_asian + student_re_hispanic +
student_re_native_american + student_re_other
)
```
```{r fit1_weighted, cache=TRUE}
fit1_weighted <- brm(
model_1_weighted, prior=prior_1,
data=d_ss, cores=4,chains=4)
```
```{r fit1_weighted_summary}
summary(fit1_weighted)
```
### Bonus: shinystan
One way to look at model outcomes is to use the 'shiny' framework in R. Shiny is a framework for making web interfaces to analyses in R, and there is a package called "shinystan" that can be really useful for summarizing and diagnosing MCMC output like brms produces. To use it, simply call the the `launch_shinystan()` function with your model as an argument:
```{r shinystan_demo}
launch_shinystan(fit1_weighted)
```