Assignment 5, SOCI 620, Winter 2019

Due Wed, Feb 20

This assignment will use Add Health data to examine students’ decisions to skip class. The outcome variable of interest asks “how many times have you skipped school for a full day without an excuse [during this school year]?”

General note: you do not need to go into as much detail justifying your priors for this assignment as you did for previous assignments. Unless your priors are significantly informative, simply describing them will be enough.

  1. Load the data.

    The data is online, and can be loaded using

    d <- read.csv('https://soci620.netlify.com/data/health_attendance.csv')

    The outcome variable is called days_skipped. The data also contains variables on the student’s overall health (gen_health), whether any parent is on social support or welfare (par_welfare), and whether any parent has a physical disability (par_disabled).

  2. Build a Poisson model of days skipped. (6 points)

    First, build standard Poisson model predicting days skipped by the student.

    1. Construct two new variables:

      One by recentering grade around grade 10

      One from gen_health that takes a value of 1 if a student rated their health as ‘Excellent’ ,‘Very good’, or ‘Good’ and a value of 0 if they rated their health as ‘Fair’ or ‘Poor’. One way to make this variable would look something like:

      d$health <- NA
      d$health[d$gen_health == 'Excellent'] <- 1
      d$health[d$gen_health == 'Very good'] <- 1
      d$health[d$gen_health == 'Good'] <- 1
      d$health[d$gen_health == 'Fair'] <- 0
      d$health[d$gen_health == 'Poor'] <- 0

      Note: When constructing variables, be careful to preserve missing (NA) values! A command like ifelse(d$gen_health %in% c('Excellent','Very good','Good'),1,0) will code NAs to 0.

      You can check whether your new variable is correctly coded by looking at a cross-tabulation of the old and new variable: table(d$health,d$gen_health). You can make sure that the missing values were preserved in the transformation by applying is.na() to both: table(is.na(d$health),is.na(d$gen_health)).

    2. Build and estimate a standard Poisson regression predicting days skipped, using centered grade level (constructed above), an indicator of good health (constructed above), family log income, and whether any parent receives public assistance (welfare) as predictors. Specify the model in full mathematical form as well as in the compact forms used by brms for the main model and the priors. Use brm() to estimate the model.

    3. Interpret the results of the model using specific terms. Which students are more or less likely to skip? How much of a change is associated with each of the predictor variables?

    4. Construct a posterior predictive plot from this model and compare it to the empirical distribution of days skipped. Does it look like a good fit?

  3. Build an over-dispersed Poisson model. (7 points)

    One reason a standard Poisson regression can go wrong is that the Poisson distribution is bad at accommodating unaccounted variation in the population. In this case, the restriction that the variance must be the same as the mean for any given type of student may be biasing our results. In this step, you will build a new model that allows more variation in your outcome variable than the standard Poisson by using a gamma-Poisson mixture.

    1. Create and estimate a gamma-Poisson model of skipped days using the same covariates as above. You will want to use the negbinomal family parameter in your model specification (recall that gamma-Poisson and negative-binamial are used to refer to the same ‘overdispersed’ Poisson model).

    2. Interpret the results of this new model. What is the estimate for the dispersion parameter? How do these estimates compare with the estimates from the standard Poisson distribution? Why might they differ?

    3. Make a posterior predictive plot for this new model and compare it to the empirical distribution. Does this look like a better fit than the standard Poisson regression?

  4. Build a zero-inflated Poisson model. (7 points)

    Another possible shortcoming of the standard Poisson regression is the potential for multiple ‘types’ of zero. Some students may not skip class because they never even considered it—they don’t think of themselves as the types of students who do that sort of thing. Other students may be comfortable skipping class, but just happened not to this school year. As a final step, you will model this situation using a zero-inflated Poisson model of days skipped.

    1. Use the zero_inflated_poisson family in brms to specify a zero-inflated model (be sure to present it in R code and in formal mathematical notation). Use the same covariates as the previous two models for the prediction of the rate of classes skipped (λ). For the Binomial portion of the model that predicts whether or not a student is a “would never skip” type of student, pick your own covariates. These can be the same, different, or similar to the ones you have already used. Talk about why you modeled this the way that you did.

      Note: To specify zero-inflated models in brms you need to specify the binomal portion of the model within the call to bf() that specifies the model. You do this by specifying a second formula, using zi as the predicted value. This will look something like:

      bf(
          days_skipped ~ <count predictors> , 
          zi ~ <binomial predictors> , 
          family = zero_inflated_poisson()
      )
    2. Estimate this model using brm(), and display the results. What type of student is more likely to never skip class? Of the students that might be willing to skip class, which are more likely to skip frequently?

    3. Make a posterior predictive plot for the zero-inflated model and compare it to the empirical distribution. Is it a good fit? Why might you want to use the gamma-Poisson model over the zero-inflated model? Why might you prefer the zero-inflated over the gamma-Poisson?