---
title: "SOCI 620: Lab 7"
author: "Peter McMahan"
date: "2023-01-31"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
# Calculating deviance and information criteria
Calculating information criteria in R is simple. AIC, BIC, DIC, WAIC, and others are included in either the base R, or else in any good inference packages (e.g. `rethinking` or `brms`).
To see how it is done, we will estimate some models using the US income data.
```{r get_income_data}
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2016.csv")
# standardize age and prestige
d$age.s <-scale(d$age)
d$occ_income.s <- scale(d$occ_income)
head(d)
```
We will predict log income using standardized age, and occupational prestige. But first, we need to do somethign about the cases with missing data. We'll talk about other strategies for dealing with missing data later in the course, but for now we'll just drop them. The `complete.cases()` function is useful for this
```{r complete_cases}
# drop incomplete cases
comp <- complete.cases(d[c('log_income','age.s','occ_income.s')])
summary(comp)
d <- d[comp,]
```
```{r fit1}
model1 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_age*age.s,
a ~ dnorm(9,3),
b_age ~ dnorm(0,3),
sigma ~ dunif(0,3)
)
fit1 <- quap(model1,data=d)
precis(fit1)
```
```{r fit2}
model2 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_occ*occ_income.s,
a ~ dnorm(9,3),
b_occ ~ dnorm(0,3),
sigma ~ dunif(0,3)
)
fit2 <- quap(model2,data=d)
precis(fit2)
```
```{r fit12}
model12 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_age*age.s + b_occ*occ_income.s,
a ~ dnorm(9,3),
b_occ ~ dnorm(0,3),
b_age ~ dnorm(0,3),
sigma ~ dunif(0,3)
)
fit12 <- quap(model12,data=d)
precis(fit12)
```
Look at the deviance for all three models.
```{r deviance_three_times}
deviance(fit1)
deviance(fit2)
deviance(fit12)
```
Even though the coefficients for age and occupational prestige are almost identical, the latter provides a much better fit with the data.
Next we will compute the AIC and WAIC for the three models.
```{r aic_three_times}
AIC(fit1)
AIC(fit2)
AIC(fit12)
```
```{r waic_three_times}
WAIC(fit1)
WAIC(fit2)
WAIC(fit12)
```
Notice that in this case, the two measures are very similar. This is because we have relatively uninformative priors, many observations, and our posterior is approximately normal.
We can also use the `compare` function to easily compare multiple WAIC values. Note that `compare` automatically puts the 'best' fitting model at the top of the list.
```{r waic_compared}
compare(fit1,fit2,fit12)
```
The model with both predictors is the unambiguous winner.
## Another comparison -- how to binarize education?
We can use this method to see which is a better predictor of income, college education or high-school education (controlling for age, occupational prestige, and gender).
```{r completer_cases}
# get complete cases across all of the variables in the data
d_c <- d[complete.cases(d),]
dim(d_c)
```
```{r fit_a_b}
model_college <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_age*age.s + b_occ*occ_income.s + b_f*female + b_col*ed_college,
a ~ dnorm(9,3),
b_occ ~ dnorm(0,3),
b_age ~ dnorm(0,3),
b_f ~ dnorm(0,3),
b_col ~ dnorm(0,3),
sigma ~ dunif(0,3)
)
fit_college <- quap(model_college, data=d_c)
model_hs <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_age*age.s + b_occ*occ_income.s + b_f*female + b_hs*ed_hs,
a ~ dnorm(9,3),
b_occ ~ dnorm(0,3),
b_age ~ dnorm(0,3),
b_f ~ dnorm(0,3),
b_col ~ dnorm(0,3),
sigma ~ dunif(0,3)
)
fit_hs <- quap(model_hs,data=d_c)
compare(fit_college,fit_hs)
```
The two models have _very_ similar fit, but high-school education does a slightly better sob predicting income than college education in this model. Does this mean you should definitely use high-school education as the predictor? Not necessarily! If your goal were simply to try to most accurately predict people's income base on age, income, gender, and education (and you are insistant on only using one of these two models), then the WAIC is a good guide. But if college education is of more theoretical relevance -- e.g. if you are considering the potential implications of a program that helps under-represented groups access college admissions -- then you should probably use the model that uses college education as the predictor.