Dealing with missing data

This lab will demonstrate different methods for dealing with missing data in a single variable in your dataset: casewise deletion, multiple imputation, and model-based imputation.

We will use a data set on test scores from the Tennessee Student Teacher Achievement Ratio (Tennessee STAR) study. The subset of the data that we will use records, among other things, the scores of first-grade students on tests of reading, mathematics, and listening ability. Our model will predict listening score using the other two test scores as covariates:

$$ \begin{aligned} R_i &\sim \mathrm{Norm}(\mu_i,\sigma)\\ \mu_i &= \beta_0 + \beta_1F_i + \beta_2M_i \end{aligned} $$

This data is relatively complete, so we will simulate missing data on math score, depending on values of the dependent variable and the size of the students' class.

full data

To start, estimate the model using the full data set without any artificially missing data.

Casewise deletion

As a baseline, we can estimate the model using only the complete cases. As we will see, this leads to biased results.

(Note: brm() automatically drops rows with missing data and gives a warning.)

Multiple imputation

Multiple imputation is a straightforward procedure within brms. If you supply a list with multiple data sets in it (or the output from a call to mice(), as below), brm_multiple() will estimate the model on each set of data. Then, because we are working in a Bayesian context, combining the model results is a simple as pooling the posterior samples.

We will use the mice package to impute values for the data. This is a very powerful package with many methods to impute missing data. By default, it predicts missing values in each variable using all other available variables. In our case, it is essentially running a linear regression predicting math scores using listening scores and reading scores as predictors. In most cases, you will want to be more careful about what predictors you use.

Model-based Bayesian imputation

As a final example, we can 'impute' the missing scores directly as part of our Bayesian model. To do so, we need to model the missing data process. We will use a simple model that says that gender and race are the only factors affecting whether the math score is missing:

$$ \begin{aligned} M_i &\sim \mathrm{Norm}(m_i,s)\\ m_i &= \alpha_0 + \alpha_1 F_i + \alpha_2 {Bl}_i + \alpha_3 {As}_i + \alpha_4 {Hi}_i + \alpha_5 {Na}_i \end{aligned} $$

Comparing results

We can plot the predicted values from each model to see how they compare.