---
title: "SOCI 620: Lab 6"
author: "Peter McMahan"
date: "1/27/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
# pander for printing output
# if it's not installed use `install.packages('pander')`
library(pander)
```
# Examining fit
This lab will introduce two ways of inspecting the "fit" of a model: posterior mean plots and posterior predictive plots.
Start by loading the data on 2016 annual income, and taking sample of 150 individuals to make our posterior distributions less precise.
```{r load_data}
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2016.csv")
# take a random sample of 150 rows
d <- d[sample(nrow(d),150),]
pander(head(d))
```
We can start by building a simple model that predicts a person's log-income using just their age.
```{r model1}
m1 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b1*age,
a ~ dnorm(8,3),
b1 ~ dnorm(0,2),
sigma ~ dunif(0,5)
)
fit1 <- quap(m1,data=d)
# look at the estimates
precis(fit1, digits=4)
```
If the estimate for `age` seems small, that is because we are measuring age in years and the range of ages in the data is large.
Let's re-estimate using standardized age.
```{r scale_age}
# just to look at the age range in the data
range(d$age)
# construct standardized age
d$age.s <- scale(d$age)
range(d$age.s)
```
```{r model1_standardized}
# recreate fit1, using standardized age
fit1 <- quap(alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b1*age.s,
a ~ dnorm(8,3),
b1 ~ dnorm(0,2),
sigma ~ dunif(0,5)
),data=d)
# look at the estimates
precis(fit1)
```
We can start to look at the predictions of our model by plotting our "point estimate" of $\mu$ as a function of standardized age.
```{r point_estimate_plot}
# set up plot window for projector visibility
options(repr.plot.width=12, repr.plot.height=8)
# first plot the data points
plot(d$age.s, d$log_income)
# note: here we used the plot(x,y) form of the plot function,
# which plots x on the horizontal axis against y on the vertical.
# There is another way to make the same scatter plot using the
# 'formula' form of plot:
# plot(log_income~age.s,data=d)
# This takes the form of plot(y~x, data=d). The two forms create
# identical output.
# get the coefficients
cf1 <- coef(fit1)
cf1
a_fit <- cf1['a']
b_fit <- cf1['b1']
# create a grid of hypothetical ages to predict
n_grid <- 400
grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# plot the "deterministic" portion of the model:
# mu <- a + b1*age.s
lines(grid , a_fit + b_fit*grid)
```
This shows us the overall trend predicted by the data, but ignores the uncertainty in our estimates of $\mu$ that comes from uncertainty in `a` and `b1`.
One simple way to try to express this uncertainty is to draw the same line as above multiple times, using draws from our posterior rather than the maximum a-posteriori estimates.
```{r plot_multipoint_estimate}
# get a sample of coefficients
n_samp <- 300
s <- extract.samples(fit1,n_samp)
head(s)
# plot the data
plot(d$age.s,d$log_income)
# create a grid of hypothetical ages to predict
n_grid <- 400
grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# create a loop to plot each of the lines
for(i in 1:n_samp){
# each time we go through this loop, i will take
# on one of the values between 1 and n_samp (30).
# get the coefficients from the i'th row of the sample s
a_fit <- s$a[i]
b_fit <- s$b1[i]
# plot the line, but make it slightly "translucent" for clarity
lines(grid , a_fit + b_fit*grid , col="#00000011")
}
```
This gives us an idea of the spread of uncertainty in our model, but can be hard to read or interpret. A common approach here is to replace the pile of spaghetti in this plot with a shaded region that will show, say, the 90% credible interval for our posterior.
The procedure for this is straightforward:
1. For every point along `grid`, draw a large number of posterior samples (say 1000) for `a` and `b1`.
2. Calculate the value of $\mu$ for each of those samples.
3. Calculate a credible interval on $\mu$ for each point long `grid`.
There are lots of ways to do this in R, and the `rethinking` package has some tools to make it easier. But we will first do it "manually" to make it clear what is going on.
```{r plot_mean_predicted_thehardway}
# create a grid of hypothetical ages to predict
n_grid <- 400
grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# how many samples at each grid point?
n_samp <- 1000
s1 <- extract.samples(fit1,n_samp)
head(s1)
# use the `sapply()` function to calculate the 90% credible
# interval at each point x in the grid
mu_ci <- sapply(grid,function(x){
# this is like a loop, where x will take on every value
# in the grid
quantile(s1$a + s1$b1 * x , probs=c(.05,.95))
})
# see what the output of this looks like
mu_ci
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `polygon` function to plot a shaded region
polygon(
x = c(grid,rev(grid)), # rev() returns its argument in reverse order
y = c(mu_ci[1,],rev(mu_ci[2,])),
border = NA , col = '#00000055')
```
Looks good! But a bit of a pain. Let's try the same thing using some functions from the `rethinking` package:
* `link()` will draw samples from posterior distribution of $\mu$ and calculated predicted values from the data data.
* `PI()` calculates the percentile interval on a vector.
* `shade()` is just a convenient wrapper around `polygon()` that we used above.
```{r plot_mean_predicted1}
# create a grid of hypothetical ages to predict
n_grid <- 400
# create a data frame listing our grid as possible values of age.s
pred_dat <- data.frame(
age.s = seq(min(d$age.s),max(d$age.s),length.out=n_grid)
)
head(pred_dat)
# how many samples at each grid point?
n_samp <- 1000
# use the 'link' function to take n_samp posterior samples at each point in pred_dat
mu_post <- link(fit1, data=pred_dat, n=n_samp)
# mu_post has one row for each of our posterior samples, and one column
# for each point in our grid.
dim(mu_post)
```
```{r plot_mean_predicted2}
# calculate the credible interval for every column of mu_post
# `apply()` here will call PI(mu[,i],prob=0.9) for every column i.
mu_ci <- apply(mu_post, 2, PI, prob=0.9)
```
```{r plot_mean_predicted3}
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `shade()` function from rethinking to draw the shaded region
shade(mu_ci, pred_dat$age.s)
```
### Posterior predictive plots
Posterior predictive plots are constructed using pretting much the same procedure, with the important difference that we are incorporating two sources of randomness. We still use randomness to characterize our uncertainty about the posterior distribution of $\mu$, but now we also want to take into account the _modeled_ uncertainty in $\sigma$.
Again, we'll make the posterior predictive plot twice: once by hand and then using functions from the `rethinking` package.
```{r post_predictive_thehardway}
# create a grid of hypothetical ages to predict
n_grid <- 400
grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# how many samples at each grid point?
n_samp <- 1000
s1 <- extract.samples(fit1,n_samp)
# use the `sapply()` function to draw a random sample
# from Norm(mu,sigma) at each point in the posterior
# Then calculate the 90% credible interval at each point x in the grid
sim_ci <- sapply(grid,function(x){
x_samp <- rnorm(n_samp,s1$a + s1$b1 * x,s1$sigma)
quantile(x_samp,probs=c(.05,.95))
})
# see what the output of this looks like
sim_ci
```
```{r post_predictive_thehardway2}
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `polygon` function to plot a shaded region
# this is identical to the code from above, but
# plots `sim_ci` instead of `mu_ci`
polygon(
x = c(grid,rev(grid)),
y = c(sim_ci[1,],rev(sim_ci[2,])),
border = NA , col = '#00000055')
```
Again, the `rethinking` package makes this a bit simpler. Now we will use the `sim()` function to simulate predicted data from the model.
```{r post_predictive_rethinking}
n_grid <- 400
# create a data frame listing our grid as possible values of age.s
pred_dat <- data.frame(
age.s = seq(min(d$age.s),max(d$age.s),length.out=n_grid)
)
```
```{r post_predictive_rethinking2}
n_samp <- 1000
# use the 'sim' function to take n_samp posterior samples at each point in pred_dat
sim_samp <- sim(fit1, data=pred_dat, n=n_samp)
# mu_post has one row for each of our posterior samples, and one column
# for each point in our grid.
dim(sim_samp)
```
```{r post_predictive_rethinking3}
# calculate the credible interval for every column of mu_post
# `apply()` here will call PI(mu[,i],prob=0.9) for every column i.
sim_ci <- apply(sim_samp, 2, PI, prob=0.9)
#sim_ci
```
```{r post_predictive_rethinking4}
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `shade()` function from rethinking to draw the shaded region
shade(sim_ci, pred_dat$age.s)
```