---
title: "SOCI 620: Lab 15"
author: "Peter McMahan"
date: "3/7/2023"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
library(mice)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
#Dealing with missing data
This lab will demonstrate different methods for dealing with missing data in a single variable in your dataset: *casewise deletion*, *multiple imputation*, and *model-based imputation*.
We will use a data set on test scores from the Tennessee Student Teacher Achievement Ratio (Tennessee STAR) study. The subset of the data that we will use records, among other things, the scores of first-grade students on tests of reading, mathematics, and listening ability. Our model will predict listening score using the other two test scores as covariates:
$$
\begin{aligned}
R_i &\sim \mathrm{Norm}(\mu_i,\sigma)\\
\mu_i &= \beta_0 + \beta_1F_i + \beta_2M_i
\end{aligned}
$$
This data is relatively complete, so we will simulate missing data on math score, depending on values of the dependent variable and the size of the students' class.
```{r download_data}
d <- read.csv("https://soci620.netlify.com/data/t_star_student_grade1.csv")
head(d)
dim(d)
summary(d[c('reading_score','math_score','female')])
```
```{r standardize}
# standardize the scores
d$reading_score <- as.numeric( scale(d$reading_score) )
d$math_score <- as.numeric( scale(d$math_score) )
# drop incomplete rows for reading and listening from this 'reference' data
# (this way d only has complete cases)
d <- d[complete.cases(d[c('reading_score','math_score','female')]),]
dim(d)
summary(d[c('reading_score','math_score','female')])
```
```{r simulate_missing}
# create a separate data set with structured missing values of math_score
# (The following command is a cludge that uses the Normal CDF to
# convert real-valued numbers to probabilities.)
missing_prob <- pnorm(-3.5 - 2*scale(d$reading_score) - 2*scale(d$math_score)+ scale(d$class_size) - d$re_white + 0.5*d$female)
# choose missing values randomly according each probability
missing_sim <- runif(length(missing_prob)) < missing_prob
table(missing_sim)
# "missing" around 1200 cases
d_missing <- d
d_missing$math_score[missing_sim] <- NA
summary(d_missing[c('reading_score','math_score','female')])
```
## Full data
To start, estimate the model using the full data set without any artificially missing data.
```{r complete_data, cache=TRUE}
# the structural model
model0 <- bf(reading_score ~ math_score + female)
# priors
prior0 <- c(
prior(normal(0,5),class=b),
prior(cauchy(0,3),class=sigma)
)
fit0 <- brm(model0,data=d,prior=prior0,cores=4,chains=4,refresh=0)
summary(fit0)
```
## Casewise deletion
As a baseline, we can estimate the model using only the complete cases. As we will see, this leads to biased results.
(Note: `brm()` automatically drops rows with missing data and gives a warning. So all we need to do here is substitute the data with the fake missing values)
```{r fit_dropmissing, cache=TRUE}
fit_drop <- brm(model0,data=d_missing,prior=prior0,cores=4,chains=4,refresh=0)
summary(fit_drop)
```
## Multiple imputation
Multiple imputation is a straightforward procedure within `brms`. If you supply a list with multiple data sets in it (or the output from a call to `mice()`, as below), `brm_multiple()` will estimate the model on each set of data. Then, because we are working in a Bayesian context, combining the model results is a simple as pooling the posterior samples.
We will use the `mice` package to impute values for the data. This is a very powerful package with many methods to impute missing data. By default, it predicts missing values in each variable using all other available variables. In our case, it is essentially running a linear regression predicting math scores using listening scores and reading scores as predictors. In most cases, you will want to be more careful about what predictors you use.
```{r multiple_imputation_data, cache=TRUE}
library(mice)
# impute 20 data sets on just the three variables of interest
d_multiple <- mice(d_missing[c('female','math_score','reading_score')],m=20,print=FALSE)
summary(d_multiple)
```
The `brm_multiple()` function will estimate the model on the kind of multiple imputation dataset we just made. It essentially just runs `brm` on each of the 20 datasets separately, then pools the results into a single sample.
You will probably want to set `open_progress=FALSE` to keep it from trying to show you the progress in a web browser.
```{r fit_multiple_imputation, cache=TRUE}
fit_multiple <- brm_multiple(model0,data=d_multiple,prior=prior0,cores=12,chains=4,open_progress=FALSE, refresh=0)
summary(fit_multiple)
```
A high `Rhat` value for these estimates is completely normal. Since each of the imputed datasets will give us somewhat different posterior samples, combining them will look like multiple chains that did not 'mix' well.
## Model-based Bayesian imputation
As a final example, we can 'impute' the missing scores directly as part of our Bayesian model. To do so, we need to model the missing data process. We will use a simple model that says that gender and race are the only factors affecting whether the math score is missing:
$$
\begin{aligned}
M_i &\sim \mathrm{Norm}(m_i,s)\\
m_i &= \alpha_0 + \alpha_1 F_i + \alpha_2 {Bl}_i + \alpha_3 {As}_i + \alpha_4 {Hi}_i + \alpha_5 {Na}_i
\end{aligned}
$$
```{r model_bayesianimputation}
model_imputed <- bf(reading_score ~ mi(math_score) + female) +
bf(math_score | mi() ~ reading_score + female + re_black + re_asian + re_hispanic + re_native_american)
# need to specify priors for both models using the 'resp' argument
# (note: these drop the underscore from the variable names)
prior_imputed <- c(
prior(normal(0,5),class=b, resp=readingscore),
prior(cauchy(0,3),class=sigma, resp=readingscore),
prior(normal(0,5),class=b, resp=mathscore),
prior(cauchy(0,3),class=sigma, resp=mathscore)
)
```
```{r fit_bayesianimputation, cache=TRUE}
fit_imputed <- brm(model_imputed,data=d_missing,prior=prior_imputed,cores=4, chains=4, refresh=0)
#control = list(max_treedepth = 15))
summary(fit_imputed)
```
## Comparing results
We can plot the predicted values from each model to see how they compare.
```{r compare_predictions}
# create the grid and make predictions
grid = seq(from=min(d$math_score),to=max(d$math_score),length.out=500)
# posterior samples from all four models
post_samps <- list(
full = posterior_samples(fit0),
delete = posterior_samples(fit_drop),
multiple = posterior_samples(fit_multiple),
model = posterior_samples(fit_imputed)
)
# rename the variables from the last model so they all have the same structure
names(post_samps$model)[c(1,3,10)] <- c('b_Intercept','b_femaleTRUE','b_math_score')
# posterior quantiles for female=FALSE
post_ci_male <- lapply(post_samps,function(s){
sapply(grid,function(x){
quantile(s$b_Intercept + x*s$b_math_score,prob=c(.05,.95))
})
})
# posterior quantiles for female=TRUE
post_ci_female <- lapply(post_samps,function(s){
sapply(grid,function(x){
quantile(s$b_Intercept + s$b_femaleTRUE + x*s$b_math_score,prob=c(.05,.95))
})
})
```
```{r plot_predictions_femaleFalse}
# prepare the plot area
options(repr.plot.width=12, repr.plot.height=8)
plot(NA,xlim=range(d$math_score),ylim=range(d$reading_score),
xlab='Math score (standardized)',ylab='Reading score (standardized)',main='Predicted test score, boys')
#plot data, indicating missing
points(d$math_score,d$reading_score,pch=ifelse(is.na(d_missing$math_score),1,19),col='#00000033')
cols <- c('#00000044','#88110044','#00881144','#11008844')
for(i in 1:4){
polygon(c(grid,rev(grid)),c(post_ci_male[[i]][1,],rev(post_ci_male[[i]][2,])),border=NA,col=cols[[i]])
}
legend('topleft',legend=c('Full data','Casewise deletion', 'Multiple imputation','Model-based imputation'),fill=cols,border=NA,bty='n')
```
```{r plot_predictions_femaleTrue}
# same for female=TRUE
# prepare the plot area
options(repr.plot.width=12, repr.plot.height=8)
plot(NA,xlim=range(d$math_score),ylim=range(d$reading_score),
xlab='Math score (standardized)',ylab='Reading score (standardized)',main='Predicted test score, girls')
#plot data, indicating missing
points(d$math_score,d$reading_score,pch=ifelse(is.na(d_missing$math_score),1,19),col='#00000033')
cols <- c('#00000044','#88110044','#00881144','#11008844')
for(i in 1:4){
polygon(c(grid,rev(grid)),c(post_ci_female[[i]][1,],rev(post_ci_female[[i]][2,])),border=NA,col=cols[[i]])
}
legend('topleft',legend=c('Full data','Casewise deletion', 'Multiple imputation','Model-based imputation'),fill=cols,border=NA,bty='n')
```