#####
# set up the environment
#####
# load packages
library(brms)
# adjust the plot region for viewing on the projector in class
options(repr.plot.width=12, repr.plot.height=8)
Loading required package: Rcpp Loading 'brms' package (version 2.14.4). Useful instructions can be found by typing help('brms'). A more detailed introduction to the package is available through vignette('brms_overview'). Attaching package: ‘brms’ The following object is masked from ‘package:stats’: ar
Ordered logistic regressions allow us to model an outcome that is categorical but that for which the order of those categories matters. Examples include educational attainment (which we'll look at here), employment (unemployed, part-time employed, full-time employed), or, very commonly, Likert-scale survey responses.
We will use respondents' ethnicity as hispanic or latino to predict educational attainment among adults in the United States. First, load the data and create a categorical variable for educational attainment with levels of "None", "High School", and "College".
d <- read.csv("https://mcmahanp.github.io/soci620/data/USincome_subset_2016.csv")
head(d)
nrow(d)
id | family_size | children_in_hh | female | age | marital_status | times_married | race | hispanic | hispanic_origin | health_ins | health_ins_priv | health_ins_employer | in_school | ed_highschool | ed_college | income | poverty | occ_income | log_income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <chr> | <int> | <chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <lgl> | <int> | <dbl> | |
1 | 1545 | 2 | 0 | 1 | 93 | Married, spouse present | 1 | White | 0 | Not Hispanic | 1 | 1 | 0 | 0 | 1 | 0 | 22800 | FALSE | NA | 10.034516 |
2 | 2532 | 3 | 0 | 0 | 29 | Never married/single | 0 | White | 0 | Not Hispanic | 1 | 1 | 1 | 0 | 1 | 0 | 22000 | FALSE | 38 | 9.998798 |
3 | 2540 | 1 | 0 | 1 | 87 | Widowed | 1 | White | 0 | Not Hispanic | 1 | 1 | 0 | 0 | 1 | 0 | 16000 | NA | NA | 9.680344 |
4 | 3654 | 6 | 1 | 1 | 21 | Never married/single | 0 | White | 0 | Not Hispanic | 1 | 1 | 1 | 1 | 1 | 0 | 7000 | FALSE | 18 | 8.853665 |
5 | 3825 | 3 | 0 | 1 | 52 | Separated | 1 | White | 0 | Not Hispanic | 0 | 0 | 0 | 0 | 0 | 0 | 20000 | FALSE | 23 | 9.903488 |
6 | 6736 | 3 | 0 | 0 | 16 | Never married/single | 0 | White | 0 | Not Hispanic | 1 | 1 | 1 | 1 | 0 | 0 | 3000 | FALSE | 24 | 8.006368 |
d$ed_level <- 1
d$ed_level[d$ed_highschool==1] <- 2
d$ed_level[d$ed_college==1] <- 3
d$ed_level[is.na(d$ed_highschool) | is.na(d$ed_college)] <- NA
table(d$ed_level)
# how many NA?
sum(is.na(d$ed_level))
1 2 3 342 2545 377
To formulate our model, we will use the cumulative
family in brms
. Using the R formula notation, brms
hides a lot of the model. The deceptively simple specification ed_level ~ hispanic
, when paired with family = cumulative()
, corresponds to the model:
That is, it specifies $k-1$ cutpoints, the logit link function, and the cumulative probability translation for creating appropriate probabilities for the categorical distribution.
model_1 <- bf(
ed_level ~ hispanic,
family = cumulative()
)
m <- alist(
ed_level ~ dordlogit(phi,c(a1,a2)),
phi <- bh*hispanic,
# note the shortcut priors
c(a1,a2) ~ dnorm(0,1.5),
bh ~ dnorm(0,2)
)
Since we are using brms
, we need to specify the priors separately. It's usually a good idea to first look at the options with get_prior()
.
get_prior(model_1,data=d)
# 'Intercept' is used for the cutoffs alpha_k
prior_1 <- c(
prior(normal(0,1.5), class = "Intercept"),
prior(normal(0,2), coef = 'hispanic')
)
prior_1
Warning message: “Rows containing NAs were excluded from the model.”
prior | class | coef | group | resp | dpar | nlpar | bound | source |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
b | default | |||||||
b | hispanic | default | ||||||
student_t(3, 0, 2.5) | Intercept | default | ||||||
Intercept | 1 | default | ||||||
Intercept | 2 | default |
prior | class | coef | group | resp | dpar | nlpar | bound | source |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
normal(0, 1.5) | Intercept | user | ||||||
normal(0, 2) | b | hispanic | user |
The model can be estimated using brm()
, but we need to be a little bit careful. Note that brm()
automatically drops rows that have missing data within our model. This is convenient, but can be dangerous.
fit_1 <- brm(model_1,prior=prior_1,data=d,cores=4,chains=4)
summary(fit_1)
Warning message: “Rows containing NAs were excluded from the model.” Compiling Stan program... Start sampling
Family: cumulative Links: mu = logit; disc = identity Formula: ed_level ~ hispanic Data: d (Number of observations: 3264) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept[1] -2.20 0.06 -2.32 -2.09 1.00 2239 2249 Intercept[2] 2.00 0.05 1.90 2.11 1.00 5094 3117 hispanic -0.84 0.18 -1.20 -0.48 1.00 2448 2370 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS disc 1.00 0.00 1.00 1.00 1.00 4000 4000 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
The estimates for Intercept[1]
($\alpha_1$) and Intercept[2]
($\alpha_2$) are a little hard to interpret on their own, but we could notice that they are pretty far apart from one another. This suggests that the middle category (high school) has a large overall probability in the model.
We can also see that the estimate for hispanic
is negative, which says that hispanic respondents are predicted to have less education overall.
Using the predict()
function, we can calculate probabilities based on these estimates.
cases <- data.frame(
hispanic = c(0,1)
)
# get hispanic and non-hispanic probabilities
pred_1 <- predict(fit_1, newdata = cases)
pred_1
# a non-Hispanic respondent has about a 10% chance of no education,
# 78% chance of a high school eduction, and a 12% chancce of a college education
# a Hispanic respondent has about a 21% chance of no education,
# 74% chance of a high school eduction, and a 5% chancce of a college education
P(Y = 1) | P(Y = 2) | P(Y = 3) |
---|---|---|
0.09750 | 0.78600 | 0.1165 |
0.21125 | 0.72725 | 0.0615 |
We can add a couple more covariates to see how that affects these calculations
model_2 <- bf(
ed_level ~ hispanic + log_income + female,
family = cumulative()
)
get_prior(model_2,data=d)
prior_2 <- c(
prior(normal(0,3), class = "Intercept"),
prior(normal(0,2), class = "b"), # this sets the same prior for all coefficients
prior(normal(0,1), coef = "log_income") # this makes an exception for one coef.
)
fit_2 <- brm(model_2, prior = prior_2, data = d, chains = 4, cores = 4)
summary(fit_2)
Warning message: “Rows containing NAs were excluded from the model.”
prior | class | coef | group | resp | dpar | nlpar | bound | source |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
b | default | |||||||
b | female | default | ||||||
b | hispanic | default | ||||||
b | log_income | default | ||||||
student_t(3, 0, 2.5) | Intercept | default | ||||||
Intercept | 1 | default | ||||||
Intercept | 2 | default |
Warning message: “Rows containing NAs were excluded from the model.” Compiling Stan program... Start sampling
Family: cumulative Links: mu = logit; disc = identity Formula: ed_level ~ hispanic + log_income + female Data: d (Number of observations: 3264) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept[1] 4.81 0.37 4.10 5.55 1.00 3519 2970 Intercept[2] 9.52 0.42 8.71 10.35 1.00 3110 2747 hispanic -0.79 0.19 -1.18 -0.40 1.00 4538 2775 log_income 0.70 0.04 0.63 0.77 1.00 3232 2760 female 0.48 0.09 0.30 0.66 1.00 3865 2861 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS disc 1.00 0.00 1.00 1.00 1.00 4000 4000 Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
cases <- data.frame(
hispanic = c(0,1,0,1),
log_income = c(log(30000), log(30000), log(260000), log(260000)),
female = 0 # this number will be repeated for all the rows
)
cases
# get hispanic and non-hispanic probabilities
pred_2 <- predict(fit_2, newdata = cases)
pred_2
# you can put them together for easier reading with `cbind()` (column bind)
cbind(cases, pred_2)
hispanic | log_income | female |
---|---|---|
<dbl> | <dbl> | <dbl> |
0 | 10.30895 | 0 |
1 | 10.30895 | 0 |
0 | 12.46844 | 0 |
1 | 12.46844 | 0 |
P(Y = 1) | P(Y = 2) | P(Y = 3) |
---|---|---|
0.08625 | 0.81750 | 0.09625 |
0.17075 | 0.78850 | 0.04075 |
0.01850 | 0.68525 | 0.29625 |
0.04525 | 0.79125 | 0.16350 |
hispanic | log_income | female | P(Y = 1) | P(Y = 2) | P(Y = 3) |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
0 | 10.30895 | 0 | 0.08625 | 0.81750 | 0.09625 |
1 | 10.30895 | 0 | 0.17075 | 0.78850 | 0.04075 |
0 | 12.46844 | 0 | 0.01850 | 0.68525 | 0.29625 |
1 | 12.46844 | 0 | 0.04525 | 0.79125 | 0.16350 |
The posterior_epred()
function lets you make prediction about education across different income levels.
# income range
income_range <- seq(from = 20000, to = 300000, length.out=100)
# Hispanic men
cases_hisp_men <- data.frame(
hispanic = 1,
log_income = log(income_range),
female = 0
)
# this gives us 4000 posterior predictions for each of the
# 3 education cateogories for each of the 100 incomes in income_range
pred_hisp_men <- posterior_epred(fit_2, newdata = cases_hisp_men)
dim(pred_hisp_men)
dimnames(pred_hisp_men)
# this gets the mean across all 4000 samples for each probablity and each income
# (the `c(2,3)` argument tells it to apply mean for each pair of values from
# the second and third 'dimensions' of the array)
means_hisp_men <- apply(pred_hisp_men, c(2,3) , mean)
plot(
NA,
xlim=range(income_range), ylim = c(0,1),
main= "predicted education for Hispanic men",
xlab = 'income'
)
lines(income_range, means_hisp_men[,1], lty=1)
lines(income_range, means_hisp_men[,2], lty=2)
lines(income_range, means_hisp_men[,3], lty=3)
legend('topright',lty=c(1,2,3),legend=c('No deg.','HS','College'))