This lab will demonstrate different methods for dealing with missing data in a single variable in your dataset: *casewise deletion*, *multiple imputation*, and *model-based imputation*.

We will use a data set on test scores from the Tennessee Student Teacher Achievement Ratio (Tennessee STAR) study. The subset of the data that we will use records, among other things, the scores of first-grade students on tests of reading, mathematics, and listening ability. Our model will predict listening score using the other two test scores as covariates:

$$ \begin{aligned} R_i &\sim \mathrm{Norm}(\mu_i,\sigma)\\ \mu_i &= \beta_0 + \beta_1F_i + \beta_2M_i \end{aligned} $$This data is relatively complete, so we will simulate missing data on math score, depending on values of the dependent variable and the size of the students' class.

In [1]:

```
d <- read.csv("https://soci620.netlify.com/data/t_star_student_grade1.csv")
head(d)
dim(d)
summary(d[c('reading_score','math_score')])
# standardize the scores
d$reading_score <- as.numeric(scale(d$reading_score))
d$math_score <- as.numeric(scale(d$math_score))
# drop imcomplete rows for reading and listening from this 'reference' data
# (this way d only has complete cases)
d <- d[complete.cases(d[c('reading_score','math_score','female')]),]
dim(d)
```

student_id | teacher_id | class_type | class_size | born | female | race_ethnicity | re_white | re_black | re_asian | re_hispanic | re_native_american | re_other | reading_score | math_score | listening_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

<int> | <int> | <int> | <int> | <dbl> | <lgl> | <chr> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <int> | <int> | <int> | |

1 | 1 | 1 | 3 | 23 | 1979.058 | FALSE | White | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | 516 | 578 | 601 |

2 | 2 | 2 | 2 | 22 | 1980.404 | FALSE | White | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | 451 | 507 | 584 |

3 | 3 | 3 | 2 | 21 | 1979.307 | TRUE | Black | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 483 | 526 | 529 |

4 | 4 | 4 | 3 | 21 | 1980.322 | FALSE | Black | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 516 | 505 | 556 |

5 | 5 | 4 | 3 | 21 | 1978.416 | TRUE | Black | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 433 | 463 | 504 |

6 | 6 | 5 | 3 | 29 | 1980.648 | TRUE | Black | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | 444 | 542 | 543 |

- 6684
- 16

reading_score math_score Min. :404.0 Min. :404.0 1st Qu.:478.0 1st Qu.:500.0 Median :514.0 Median :529.0 Mean :520.8 Mean :530.5 3rd Qu.:558.0 3rd Qu.:557.0 Max. :651.0 Max. :676.0 NA's :289 NA's :86

- 6377
- 16

In [2]:

```
# create a separate data set with structured missing values of math_score
# (The following command is a cludge that uses the Normal CDF to
# convert real-valued numbers to probabilities.)
missing_prob <- pnorm(-3.5 - 2*scale(d$reading_score) - 2*scale(d$math_score)+ scale(d$class_size) - d$re_white + 0.5*d$female)
# choose missing values randomly according each probability
missing_sim <- runif(length(missing_prob)) < missing_prob
table(missing_sim)
# "missing" around 1200 cases
d_missing <- d
d_missing$math_score[missing_sim] <- NA
summary(d_missing[c('reading_score','math_score','female')])
```

missing_sim FALSE TRUE 5195 1180

reading_score math_score female Min. :-2.116005 Min. :-2.0768 Mode :logical 1st Qu.:-0.775240 1st Qu.:-0.3602 FALSE:3290 Median :-0.122976 Median : 0.1733 TRUE :3087 Mean : 0.001262 Mean : 0.2849 3rd Qu.: 0.674236 3rd Qu.: 0.8461 Max. : 2.359251 Max. : 3.3745 NA's :1180

To start, estimate the model using the full data set without any artificially missing data.

In [3]:

```
library(brms)
# the structural model
m <- bf(reading_score ~ math_score + female)
# priors
pr <- c(
prior(normal(0,5),class=b),
prior(cauchy(0,3),class=sigma)
)
f_full <- brm(m,data=d,prior=pr,cores=4,chains=4)
summary(f_full)
```

Loading required package: Rcpp Loading 'brms' package (version 2.14.4). Useful instructions can be found by typing help('brms'). A more detailed introduction to the package is available through vignette('brms_overview'). Attaching package: ‘brms’ The following object is masked from ‘package:stats’: ar Compiling Stan program... Start sampling

Family: gaussian Links: mu = identity; sigma = identity Formula: reading_score ~ math_score + female Data: d (Number of observations: 6377) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept -0.10 0.01 -0.13 -0.08 1.00 4052 2925 math_score 0.73 0.01 0.71 0.74 1.00 4625 3207 femaleTRUE 0.20 0.02 0.17 0.23 1.00 4114 3112 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS sigma 0.68 0.01 0.67 0.69 1.00 4644 2818 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).

As a baseline, we can estimate the model using only the complete cases. As we will see, this leads to biased results.

(Note: `brm()`

automatically drops rows with missing data and gives a warning.)

In [4]:

```
f_del <- brm(m,data=d_missing,prior=pr,cores=4,chains=4)
summary(f_del)
```

Warning message: “Rows containing NAs were excluded from the model.” Compiling Stan program... Start sampling

Family: gaussian Links: mu = identity; sigma = identity Formula: reading_score ~ math_score + female Data: d_missing (Number of observations: 5197) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept -0.04 0.01 -0.07 -0.01 1.00 4796 3194 math_score 0.65 0.01 0.63 0.67 1.00 4842 3213 femaleTRUE 0.22 0.02 0.18 0.26 1.00 4828 3416 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS sigma 0.71 0.01 0.70 0.73 1.00 4137 2744 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).

Multiple imputation is a straightforward procedure within `brms`

. If you supply a list with multiple data sets in it (or the output from a call to `mice()`

, as below), `brm_multiple()`

will estimate the model on each set of data. Then, because we are working in a Bayesian context, combining the model results is a simple as pooling the posterior samples.

We will use the `mice`

package to impute values for the data. This is a very powerful package with many methods to impute missing data. By default, it predicts missing values in each variable using all other available variables. In our case, it is essentially running a linear regression predicting math scores using listening scores and reading scores as predictors. In most cases, you will want to be more careful about what predictors you use.

In [5]:

```
library(mice)
# impute 20 data sets on just the three variables of interest
d_multiple <- mice(d_missing[c('female','math_score','reading_score')],m=20,print=FALSE)
```

In [6]:

```
f_multiple <- brm_multiple(reading_score ~ math_score + female,data=d_multiple,prior=pr,cores=12,chains=4)
summary(f_multiple)
```

Compiling the C++ model Fitting imputed model 1 Start sampling Fitting imputed model 2 Start sampling Fitting imputed model 3 Start sampling Fitting imputed model 4 Start sampling Fitting imputed model 5 Start sampling Fitting imputed model 6 Start sampling Fitting imputed model 7 Start sampling Fitting imputed model 8 Start sampling Fitting imputed model 9 Start sampling Fitting imputed model 10 Start sampling Fitting imputed model 11 Start sampling Fitting imputed model 12 Start sampling Fitting imputed model 13 Start sampling Fitting imputed model 14 Start sampling Fitting imputed model 15 Start sampling Fitting imputed model 16 Start sampling Fitting imputed model 17 Start sampling Fitting imputed model 18 Start sampling Fitting imputed model 19 Start sampling Fitting imputed model 20 Start sampling Warning message: “Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors.”

Family: gaussian Links: mu = identity; sigma = identity Formula: reading_score ~ math_score + female Data: d_multiple (Number of observations: 6377) Samples: 80 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 80000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept -0.20 0.01 -0.23 -0.17 1.12 407 1958 math_score 0.76 0.01 0.74 0.78 1.07 672 2625 femaleTRUE 0.22 0.02 0.18 0.26 1.05 845 3016 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS sigma 0.73 0.01 0.72 0.75 1.30 200 628 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).

As a final example, we can 'impute' the missing scores directly as part of our Bayesian model. To do so, we need to model the missing data process. We will use a simple model that says that gender and race are the only factors affecting whether the math score is missing:

$$ \begin{aligned} M_i &\sim \mathrm{Norm}(m_i,s)\\ m_i &= \alpha_0 + \alpha_1 F_i + \alpha_2 {Bl}_i + \alpha_3 {As}_i + \alpha_4 {Hi}_i + \alpha_5 {Na}_i \end{aligned} $$In [7]:

```
m_imputed <- bf(reading_score ~ mi(math_score) + female) +
bf(math_score | mi() ~ reading_score + female + re_black + re_asian + re_hispanic + re_native_american)
# need to specify priors for both models using the 'resp' argument
# (note: these drop the underscore from the variable names)
pr_imputed <- c(
prior(normal(0,5),class=b,resp=readingscore),
prior(cauchy(0,3),class=sigma,resp=readingscore),
prior(normal(0,5),class=b,resp=mathscore),
prior(cauchy(0,3),class=sigma,resp=mathscore)
)
f_imputed <- brm(m_imputed,data=d_missing,prior=pr_imputed,cores=4, chains=4,control = list(max_treedepth = 15))
summary(f_imputed)
```

Setting 'rescor' to FALSE by default for this model Warning message: “Rows containing NAs were excluded from the model.” Compiling Stan program... Start sampling

Family: MV(gaussian, gaussian) Links: mu = identity; sigma = identity mu = identity; sigma = identity Formula: reading_score ~ mi(math_score) + female math_score | mi() ~ reading_score + female + re_black + re_asian + re_hispanic + re_native_american Data: d_missing (Number of observations: 6375) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat readingscore_Intercept -0.14 0.01 -0.16 -0.11 1.00 mathscore_Intercept 0.18 0.01 0.15 0.20 1.00 readingscore_femaleTRUE 0.22 0.02 0.18 0.25 1.00 mathscore_reading_score 0.68 0.01 0.66 0.70 1.00 mathscore_femaleTRUE -0.16 0.02 -0.19 -0.12 1.00 mathscore_re_blackTRUE -0.17 0.02 -0.21 -0.13 1.00 mathscore_re_asianTRUE -0.13 0.16 -0.45 0.18 1.00 mathscore_re_hispanicTRUE 0.06 0.24 -0.42 0.52 1.00 mathscore_re_native_americanTRUE -0.88 0.41 -1.68 -0.05 1.00 readingscore_mimath_score 0.74 0.01 0.72 0.75 1.00 Bulk_ESS Tail_ESS readingscore_Intercept 4906 2912 mathscore_Intercept 5058 2898 readingscore_femaleTRUE 5690 3142 mathscore_reading_score 3314 3055 mathscore_femaleTRUE 5319 3133 mathscore_re_blackTRUE 4805 3162 mathscore_re_asianTRUE 5899 2855 mathscore_re_hispanicTRUE 4145 2765 mathscore_re_native_americanTRUE 4100 2916 readingscore_mimath_score 5302 2882 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS sigma_readingscore 0.69 0.01 0.68 0.70 1.00 5838 3086 sigma_mathscore 0.67 0.01 0.66 0.68 1.00 4276 2864 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).

In [8]:

```
summary(f_imputed)
```

We can plot the predicted values from each model to see how they compare.

In [9]:

```
# create the grid and make predictions
grid = seq(from=min(d$math_score),to=max(d$math_score),length.out=500)
# posterior samples from all four models
post_samps <- list(
full = posterior_samples(f_full),
delete = posterior_samples(f_del),
multiple = posterior_samples(f_multiple),
model = posterior_samples(f_imputed)
)
# rename the variables from the last model so they all have the same structure
names(post_samps$model)[c(1,3,10)] <- c('b_Intercept','b_femaleTRUE','b_math_score')
# posterior quantiles for female=FALSE
post_ci_male <- lapply(post_samps,function(s){
sapply(grid,function(x){
quantile(s$b_Intercept + x*s$b_math_score,prob=c(.05,.95))
})
})
# posterior quantiles for female=TRUE
post_ci_female <- lapply(post_samps,function(s){
sapply(grid,function(x){
quantile(s$b_Intercept + s$b_femaleTRUE + x*s$b_math_score,prob=c(.05,.95))
})
})
```

In [10]:

```
# prepare the plot area
options(repr.plot.width=12, repr.plot.height=8)
plot(NA,xlim=range(d$math_score),ylim=range(d$reading_score),
xlab='Math score (standardized)',ylab='Reading score (standardized)',main='Predicted test score, boys')
#plot data, indicating missing
points(d$math_score,d$reading_score,pch=ifelse(is.na(d_missing$math_score),1,19),col='#00000033')
cols <- c('#00000044','#88110044','#00881144','#11008844')
for(i in 1:4){
polygon(c(grid,rev(grid)),c(post_ci_male[[i]][1,],rev(post_ci_male[[i]][2,])),border=NA,col=cols[[i]])
}
legend('topleft',legend=c('Full data','Casewise deletion', 'Multiple imputation','Model-based imputation'),fill=cols,border=NA,bty='n')
```