z---
title: "Problem Set 5"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
library(pander)
```
_Due Tue, Mar 15_
In this (extended) assignment, you will use the Add Health data to examine two aspects of adolescent students' engagement with their schools: class attendance and feelings of belonging.
*General note: you do not need to go into as much detail justifying your priors for this assignment as you did for previous assignments. Unless your priors are significantly informative, simply describing them will be enough.*
The data is online, and can be loaded using:
```{r data}
d <- read.csv('https://soci620.netlify.com/data/addhealth_schoolenv.csv')
```
# Part A: Skipping class
One obvious measure of a student's engagement with their school is whether they actually show up. In this first part of the problem set, you will model the frequency of skipping classes. The outcome variable of interest asks "how many times have you skipped school for a full day without an excuse [during this school year]?"
The outcome variable is called `days_skipped`. The data also contains variables on the student's overall health (`gen_health`), whether any parent is on social support or welfare (`par_welfare`), and whether any parent has a physical disability (`par_disabled`), and various other measures.
1. **Build a Poisson model of days skipped.**
First, build standard Poisson model predicting days skipped by the student.
a) Construct two new variables:
The first should be a simple recentering `grade` around grade 10.
The second constructed variable should be built from from `gen_health` and take a value of 1 if a student rated their health as 'Excellent' ,'Very good', or 'Good' and a value of 0 if they rated their health as 'Fair' or 'Poor'. *Be careful to preserve missing values as `NA`!*
```{r recode}
# your code here
```
b) Build and estimate a standard Poisson regression predicting days skipped, using centered grade level (constructed above), an indicator of good health (constructed above), family log income, and whether any parent receives public assistance (welfare) as predictors. You can use the `brm` function with the `poisson` family to estimate the model.
c) Interpret the results of this Poisson model using specific terms. Which students are more or less likely to skip? How much of a change is associated with each of the predictor variables?
2. **Build an over-dispersed Poisson model.**
One reason a standard Poisson regression can go wrong is that the Poisson distribution is bad at accommodating unmodeled variation in the population. In this case, the restriction that the variance must be the same as the mean for any given type of student may be biasing our results. In this step, you will build a new model that allows more variation in your outcome variable than the standard Poisson by using a gamma-Poisson mixture.
a) Create and estimate a negative-binomial (a.k.a. gamma-Poisson) model of skipped days using the same covariates as above. You will want to use the `negbinomial` family in your model specification.
b) Interpret the results of this new model. What is the estimate for the dispersion parameter? How do these estimates compare with the estimates from the standard Poisson distribution? Why might they differ?
2. **Compare the models.**
Which of the two models you just estimated (the Poisson and the gamma-Poisson) is better? Since they use the same covariates and the same linear model, there is no obvious theoretical reason to choose one over the other. Instead, you can compare the fit of the two models:
a) Create a posterior preictive plot for each of the two models, and describe what you see. Which seems to better fit the emprical distribution of days skipped? (*Note: the `brms` package includes a function `posterior_predict()`, which behaves similarly to `sim()` in the rethinking package.*)
b) Compare the WAIC for each model. Which would you expect to have better predictive power? (*You'll want to run `waic()` on each of the model fits. You can then use `loo_compare()` if you want to compare their difference.*)
c) Given these results, which model would you use to help understand school absences?
# Part A: Feeling connected
In a series of questions relating to the way students felt about their school as an institution and the students and teachers there, the Add Health respondents were asked how much they agreed or disagreed with the statement "*You feel like you are part of your school.*" They were given a five-point Likert scale for the question, coded as *(1) Strongly disagree*, *(2) Disagree*, *(3) Neither agree nor disagree*, *(4) Agree*, and *(5) Strongly agree*. You will be building a series of models that investigate how student feelings of belonging are tied to their race and perceptions of safety at the school.
You will be modeling this outcome using the two categorical linear models we learned about in class.
1. **Build a multinomial logit model of belonging.**
First, build a model that *ignores the ordering of the values in the outcome variable*, treating each response to `part_of_school` (1 through 5) as a distinct category. (You will build an ordered logit in the next question.)
a) Often, feeling of membership in a community are contingent on both individual characteristics and perceptions of the way the community treats others. In this question you will be incorporating the variable `school_feels_safe` as a measure of students' perception of a hostile school community. The variable uses the same Likert scale of agreement, now with the prompt "*You feel safe in your school.*"
For the purposes of this assignment, we will be treating `school_feels_safe` as a ratio-scale variable. There are ways to be more sophisticated about including ordinal variables as predictors (see the textbook), but for now a simpler modeling strategy will work. However, the variable will be easier to interpret if you center it. Create a new variable that centers `school_feels_safe` at its middle value of 3. The resulting variable should have a value of 0 represent "neither agree nor disagree", while -2 and 2 represent "strongly disagree" and "strongly agree", respectively.
b) Use the `categorical` family with `brms` to make a model that predicts categories of `part_of_school` based on the degree to which the student feels safe at school. Specify priors for the coefficients as well (you may leave the default prior for the intercepts if you like). How many coefficient parameters does your model have? Which category did `brms` choose as the reference category?
c) Estimate your model using `brm()`. What do the estimates tell you about the relationship between feelings of safety and feelings of belonging among the students (be specific)?
d) What are the consequences of using this model rather than an ordered logit model? What are the drawbacks (if any) and what are the benefits (if any)?
2. **Build an ordered logit model.**
Survey responses that use a Likert scale like `part_of_school` are very frequently modeled using an ordered logit.
a) Use the `cumulative` family to create a new model predicting `part_of_school` with your centered version of `school_feels_safe` as a covariate. Describe the parameters of the model. What are the coefficients and what role do they play in the model?
b) Estimate this new model using `brm()`. What do the estimates tell you about the relationship between feelings of safety and feelings of belonging among the students (be specific)?
2. **Compare the models.**
The two models you just estimated (the categorical and the ordered logit) look similar, but are qujite different in terms of the underlying model itself. To compare them, you will need to consider both predictive fit and theoretical concerns.
a) Create a posterior preictive plot for each of the two models, and describe what you see. Which seems to better fit the emprical distribution of days skipped? (*Note: the `brms` package includes a function `posterior_predict()`, which behaves similarly to `sim()` in the rethinking package.*)
b) Compare the WAIC for each model. Which would you expect to have better predictive power? (*You'll want to run `waic()` on each of the model fits. You can then use `loo_compare()` if you want to compare their difference.*)
d) Look at the results from the models. Do they tell substantively similar stories? Why might you prefer the ordered logit over the multinomial logit?