---
title: "SOCI 620: Lab 13"
author: "Peter McMahan"
date: "2/24/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(brms)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
## Ordered logistic regressions
Ordered logistic regressions allow us to model an outcome that is categorical but for which the _order_ of those categories matters. Examples include educational attainment (which we'll look at here), employment (unemployed, part-time employed, full-time employed), or, very commonly, Likert-scale survey responses.
We will use respondents' ethnicity as Hispanic or Latino to predict educational attainment among adults in the United States. First, load the data and create a categorical variable for educational attainment with levels of "None", "High School", and "College".
```{r data}
d <- read.csv("https://soci620.netlify.app/data/USincome_subset_2016.csv")
head(d)
table(d[c('ed_highschool','ed_college')])
```
```{r build_ordinal_education_level}
d$ed_level <- 1 # no degree
d$ed_level[d$ed_highschool==1] <- 2 # high school degree
d$ed_level[d$ed_college==1] <- 3 # college degree
d$ed_level[is.na(d$ed_highschool) | is.na(d$ed_college)] <- NA # missing education info
table(d$ed_level)
# how many NA?
sum(is.na(d$ed_level))
```
To formulate our model, we will use the `cumulative` family in `brms`. Using the R formula notation, `brms` hides a *lot* of the model. The deceptively simple specification `ed_level ~ hispanic`, when paired with `family = cumulative()`, corresponds to the model:
$$
\begin{aligned}
\mathrm{ed\_level}_i &\sim \mathrm{Categorical}(p_1,\dots,p_k)\\
p_k &= q_k - q_{k-1}\\
\mathrm{logit}(q_k) &= \alpha_k - \phi_i\\
\phi_i &= \beta \cdot \mathrm{hispanic}_i
\end{aligned}
$$
That is, it specifies $k-1$ cutpoints, the logit link function, and the cumulative probability translation for creating appropriate probabilities for the categorical distribution.
```{r model_1}
model_1 <- bf(
ed_level ~ hispanic,
family = cumulative()
)
```
Since we are using `brms`, we need to specify the priors separately. It's usually a good idea to first look at the options with `get_prior()`.
```{r prior_1_1}
get_prior(model_1,data=d)
# 'Intercept' is used for the cutoffs alpha_k
```
```{r prior_1_2}
prior_1 <- c(
prior(normal(0,1.5), class = "Intercept"),
prior(normal(0,2), coef = 'hispanic')
)
prior_1
```
The model can be estimated using `brm()`, but we need to be a little bit careful. Note that `brm()` *automatically drops* rows that have missing data within our model. This is convenient, but can be dangerous.
```{r fit_1, cache=TRUE}
fit_1 <- brm(model_1,prior=prior_1,data=d,cores=4,chains=4, refresh=FALSE)
summary(fit_1)
```
The estimates for `Intercept[1]` ($\alpha_1$) and `Intercept[2]` ($\alpha_2$) are a little hard to interpret on their own, but we might notice that they are "pretty far" apart from one another. This suggests that the middle category (high school) has a large overall probability in the model.
We can also see that the estimate for `hispanic` is negative, which says that Hispanic respondents are predicted to have less education overall.
Using the `predict()` function, we can calculate probabilities based on these estimates.
```{r pred_1}
cases <- data.frame(
hispanic = c(0,1)
)
cases
# get hispanic and non-hispanic probabilities
pred_1 <- predict(fit_1, newdata = cases)
pred_1
```
According to these results, a non-Hispanic respondent has about a 10% chance of no education, a 78% chance of a high school education, and a 12% chance of a college education. In contrast, a Hispanic respondent has about a 21% chance of no education, a 74% chance of a high school education, and a 5% chance of a college education.
### More covariates
We can add a couple more covariates to see how that affects these calculations
```{r model_2}
model_2 <- bf(
ed_level ~ hispanic + log_income + female,
family = cumulative()
)
```
```{r prior_2}
get_prior(model_2,data=d)
prior_2 <- c(
prior(normal(0,3), class = "Intercept"),
prior(normal(0,2), class = "b"), # this sets the same prior for all coefficients
prior(normal(0,1), coef = "log_income") # this makes an exception for one coef.
)
```
```{r fit_2, cache=TRUE}
fit_2 <- brm(model_2, prior = prior_2, data = d, chains = 4, cores = 4, refresh=0)
summary(fit_2)
```
```{r pred_2}
cases <- data.frame(
hispanic = c(0,1,0,1),
log_income = c(log(30000), log(30000), log(260000), log(260000)),
female = 0 # this number will be repeated for all the rows
)
cases
```
```{r pred_2_part2}
# get hispanic and non-hispanic probabilities
pred_2 <- predict(fit_2, newdata = cases)
pred_2
# you can put them together for easier reading with `cbind()` (column bind)
cbind(cases, pred_2)
```
### Visualizations
The `posterior_epred()` function lets you make prediction about education across different income levels.
```{r pred_hisp_men}
# income range
income_range <- seq(from = 20000, to = 300000, length.out=100)
# Hispanic men
cases_hisp_men <- data.frame(
hispanic = 1,
log_income = log(income_range),
female = 0
)
head(cases_hisp_men)
# this gives us 4000 posterior predictions for each of the
# 3 education categories for each of the 100 incomes in income_range
pred_hisp_men <- posterior_epred(fit_2, newdata = cases_hisp_men)
# note that this gives us 4000 posterior samples for each of the income levels
dim(pred_hisp_men)
```
```{r pred_hisp_men_viz}
# this gets the mean across all 4000 samples for each probablity and each income
# (the `c(2,3)` argument tells it to apply mean for each pair of values from
# the second and third 'dimensions' of the array)
means_hisp_men <- apply(pred_hisp_men, c(2,3) , mean)
plot(
NA,
xlim=range(income_range), ylim = c(0,1),
main= "predicted education for Hispanic men",
xlab = 'income'
)
lines(income_range, means_hisp_men[,1], lty=1)
lines(income_range, means_hisp_men[,2], lty=2)
lines(income_range, means_hisp_men[,3], lty=3)
legend('topright',lty=c(1,2,3),legend=c('No deg.','HS','College'))
```
We can get the same thing with 90% credible intervals:
```{r pred_hisp_men_viz2}
# this gets the the 5% and 95% across all 4000 samples for each outcome probability
# and each income value
lbound_hisp_men <- apply(pred_hisp_men, c(2,3) , quantile,probs=0.05)
ubound_hisp_men <- apply(pred_hisp_men, c(2,3) , quantile,probs=0.95)
# this bit looks the same as before
plot(
NA,
xlim=range(income_range), ylim = c(0,1),
main= "predicted education for Hispanic men",
xlab = 'income'
)
lines(income_range, means_hisp_men[,1], lty=1)
lines(income_range, means_hisp_men[,2], lty=2)
lines(income_range, means_hisp_men[,3], lty=3)
# but now we draw a 'polygon' around each predictive line
# (this is tricky because we need to draw the top and the bottom
# in the same command, hence the calls to 'rev()' to reverse the order)
polygon(
x = c(income_range,rev(income_range)),
y = c(lbound_hisp_men[,1],rev(ubound_hisp_men[,1])),
col='#00000033', border=NA
)
polygon(
x = c(income_range,rev(income_range)),
y = c(lbound_hisp_men[,2],rev(ubound_hisp_men[,2])),
col='#00000033', border=NA
)
polygon(
x = c(income_range,rev(income_range)),
y = c(lbound_hisp_men[,3],rev(ubound_hisp_men[,3])),
col='#00000033', border=NA
)
legend('topright',lty=c(1,2,3),legend=c('No deg.','HS','College'))
```