---
title: "SOCI 620: Lab 12"
author: "Peter McMahan"
date: "2003-02-16"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# load brms -- not rethinking
library(brms)
# this will tell brms to use as many cores as your computer has
options(mc.cores = parallel::detectCores())
```
## Multinomial logistic regression
We'll be using the US microdata sample that we have seen before to predict the association between income and marital status.
```{r data}
# download the data
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2016.csv")
# look at it
head(d)
```
What categories exist for marital status?
```{r inspect_marital_status}
summary(d$marital_status)
table(d$marital_status)
```
That's more fine-grained than we want. We can make four categories from these six by putting "Divorced" and "Separated" in one category and "Married, spouse absent" and "Married, spouse present" in another category.
At the same time, let's code the new variable so that married respondents are the reference category. (Choosing a reference category is arbitrary as far as the model is concerned, so it should be based on your specific research question.)
```{r marriage_recode}
# start by setting all the values to NA
d$ms <- NA
# assign every category manually
d$ms[d$marital_status=="Never married/single"] <- 2
d$ms[d$marital_status=="Married, spouse absent"] <- 1
d$ms[d$marital_status=="Married, spouse present"] <- 1
d$ms[d$marital_status=="Divorced"] <- 3
d$ms[d$marital_status=="Separated"] <- 3
d$ms[d$marital_status=="Widowed"] <- 4
```
When recoding a categorical variable it is *always* a good idea to look at a cross-tabulation to make sure you made no mistakes.
```{r verify_marital_recode}
table(d$marital_status, d$ms)
```
## Estimating with `brms`
Define the model using R's `formula` syntax and the `bf()` function from `brms`:
```{r model_1}
# (Intercepts in R formulas are assumed!)
# the categorical family automatically creates necessary
# multiple intercepts and multiple coefficients
# use `?brmsfamily` to see the large array of models supported by brms
model_1 <- bf(
ms ~ log_income,
family=categorical()
)
model_1
```
Priors in `brms` are specified *separately* with the `prior()` function. You can see the 'default' priors that would be used with `get_prior()`:
```{r default_priors_1}
# get_prior needs to know what data you're using
get_prior(model_1, data=d)
```
This table can be a little confusing to read! Each row represents a different parameter *or class of parameters* in the model.
The fifth row, for instance, tells us the prior for the `Intercept` of `mu2`, which is the intercept for category 2: `student_t(3,1,2.5)`.
The fourth row shows us the prior for `log_income` for category 2. But it just says "(flat)". By default, `brms` uses 'improper' flat priors for coefficients. *These should almost always be changed*!
The rows that do not specify a particular coefficient or intercept are 'classes' of priors. This allows you to set priors for "all of the coefficients" or "all of the intercepts."
Here we'll define a prior of $\mathrm{Norm}(0,3)$ for all of the intercepts and $\mathrm{Norm}(0,2)$ for all of the coefficients on `log_income`:
```{r prior_1}
# priors are grouped using the c() function:
prior_1 <- c(
prior(normal(0,2), class = 'b'),
prior(normal(0,3), class = 'Intercept')
)
#take a look
prior_1
```
We 'fit' the model using the `brm()` function. This can be slow! It's often a good idea to put your calls to `brm` in their own code chunk, and specify `cache=TRUE`. This means that if you change something else in your Rmd file that doesn't affect this code like some text, or something further on in the script, it will just used cached results instead of rerunning the fit.
```{r fit_1, cache=TRUE}
fit_1 <- brm(model_1, prior=prior_1, data=d, cores=4, chains=4)
```
```{r fit_1_inspect_prior}
# you can inspect the priors that were used
prior_summary(fit_1)
```
```{r fit_1_inspect_estimates}
# and get an overview of the estimates using summary (precis() was specific to rethinking)
summary(fit_1,prob=0.95)
```
Aside from the sign and confidence of the estimates, it is very hard to interpret these coefficients on their own. We can say that income is negatively associated with being single, divorced, or widowed *in comparison with* being married. That is, as income increases, the model predicts a higher probability of being married rather than single, of being married rather than divorced, and of being married rather than widowed.
We can get a better idea of the specific relationships by predicting outcomes for specific sets of covariates. We'll start by comparing people that make \$10,000 per year to people that make \$100,000 per year.
```{r fixef_1}
# get the predicted coefficients
# (brms uses the `fixef()` ("fixed effects") function for this rather than `coef()`)
est_1 <- fixef(fit_1)
est_1
```
The `predict()` function makes it very easy to talk about specific cases (like $10k vs $100k incomes):
```{r fit_1_cases}
cases <- data.frame(
log_income = c(log(1e3),log(1e4),log(1e5),log(1e10))
)
cases
```
```{r fit_1_predict}
pred <- predict(fit_1, newdata=cases)
pred
```
```{r fit_1_cases_viz1}
# visualizations can help a lot here
barplot(pred[2,],main='Expected probabilities, $10 thousand/year',ylim=c(0,1))
barplot(pred[3,],main='Expected probabilities, $100 thousand/year',ylim=c(0,1))
```
```{r fit_1_cases_viz2}
# side-by-side using R's `layout` function.
# (this will look different in ggplot)
layout(matrix(1:2,nrow=1))
barplot(pred[2,],main='Expected probabilities, $10 thousand/year',ylim=c(0,1))
barplot(pred[3,],main='Expected probabilities, $100 thousand/year',ylim=c(0,1))
layout(1)
```