---
title: "SOCI 620: Lab 6"
author: "Peter McMahan"
date: "1/26/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
# Prior predictive plots
This lab will follow the lecture, using (standardized) age to predict log income. We'll start by creating some prior predictive plots to choose a prior. To emphasize the 'prior' aspect of this, we can make these predictions before we load any data. Our model will look like this:
$$
\begin{aligned}
\log(\mathrm{Inc}_i) &\sim \mathrm{Norm}(\mu_i,\sigma) \\
\mu_i &= \alpha + \beta \mathrm{St}(\mathrm{Age}_i) \\
&\\
\alpha &\sim \mathrm{Norm}(?,?)\\
\beta &\sim \mathrm{Norm}(?,?) \\
\sigma &\sim \mathrm{Unif}(0,5)
\end{aligned}
$$
We're trying to fill in the hyperparameters for $\alpha$ and $\beta$.
First, we need to think of a reasonable range for our variables. For standardized age, this is easy -- a standardized variable stays mostly in the range (-2, 2). For log income, we can set some very wide ranges for what we think the average income of any particular age group might be. Log incomes in the range (0, 20) will correspond with dollar incomes between $1 to about $485 million per year.
```{r prior_ranges}
# age range
age_range <- c(-2, 2)
# log income range
logincome_range <- c(0, 20)
# a 'grid' of ages in the range
n_grid <- 400
age_grid <- seq(from = age_range[1], to = age_range[2], length.out = n_grid)
```
Now we can plot some hypothetical model predictions, based only on the prior. To make a point, let's start with a very bad prior: $\alpha \sim \mathrm{Norm}(8.0, 0.5)$ and $\beta \sim \mathrm{Norm}(-1.0, 0.5)$. To make a prior predictive plot with these, we will draw, say, 200 random samples from each of these prior distributions and plot the resulting prediction.
```{r prior_plot_1}
# create an 'empty' plot by plotting "NA"
# this means you have to give it explicit values
# for the range of the plot
plot(
NA, xlim = age_range, ylim = logincome_range,
xlab = "Age (standardized)",
ylab = "Log income"
)
# define the number of prior samples
n_samp <- 200
# use a 'for' loop to plot each predicted line
for(i in 1:n_samp){
# each time we go through this loop, i will take
# on one of the values between 1 and n_samp (200).
# draw one random sample from each of the priors
a_prior <- rnorm(1, mean = 8.0, sd = 0.5)
b_prior <- rnorm(1, mean = -1.0, sd = 0.5)
# plot the line, but make it slightly "translucent" for clarity
lines(age_grid , a_prior + b_prior*age_grid , col="#00000011")
}
```
Let's try now with a more reasonable prior. This code is exactly the same as the last code block, but changing out the values for the normal distributions.
```{r prior_plot_2}
# create an 'empty' plot by plotting "NA"
# this means you have to give it explicit values
# for the range of the plot
plot(
NA, xlim = age_range, ylim = logincome_range,
xlab = "Age (standardized)",
ylab = "Log income"
)
# define the number of prior samples
n_samp <- 200
# use a 'for' loop to plot each predicted line
for(i in 1:n_samp){
# each time we go through this loop, i will take
# on one of the values between 1 and n_samp (200).
# draw one random sample from each of the priors
a_prior <- rnorm(1, mean = 10.0, sd = 2.0)
b_prior <- rnorm(1, mean = 0, sd = 3.0)
# plot the line, but make it slightly "translucent" for clarity
lines(age_grid , a_prior + b_prior*age_grid , col="#00000011")
}
```
# Examining fit
Next, we will introduce two ways of inspecting the "fit" of a model: posterior mean plots and posterior predictive plots.
Start by loading the data on 2016 annual income, and taking sample of 150 individuals to make our posterior distributions less precise.
```{r load_data}
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2016.csv")
# take a random sample of 150 rows
d <- d[sample(nrow(d),150),]
head(d)
```
We can start by building a simple model that predicts a person's log-income using just their age.
```{r model1}
m0 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b1*age,
a ~ dnorm(10,2),
b1 ~ dnorm(0,3),
sigma ~ dunif(0,5)
)
fit0 <- quap(m1,data=d)
# look at the estimates
precis(fit1, digits=4)
```
If the estimate for `age` seems small, that is because we are measuring age in years and the range of ages in the data is large.
Let's re-estimate using standardized age.
```{r scale_age}
# just to look at the age range in the data
range(d$age)
# construct standardized age
d$age.s <- scale(d$age)
range(d$age.s)
```
```{r model1_standardized}
# recreate fit1, using standardized age
m1 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b1*age.s,
a ~ dnorm(10,2),
b1 ~ dnorm(0,3),
sigma ~ dunif(0,5)
)
fit1 <- quap(m1,data=d)
# look at the estimates
precis(fit1)
```
We can start to look at the predictions of our model by plotting our "point estimate" of $\mu$ as a function of standardized age.
```{r point_estimate_plot}
# set up plot window for projector visibility
options(repr.plot.width=12, repr.plot.height=8)
# first plot the data points
plot(d$age.s, d$log_income)
# note: here we used the plot(x,y) form of the plot function,
# which plots x on the horizontal axis against y on the vertical.
# There is another way to make the same scatter plot using the
# 'formula' form of plot:
# plot(log_income~age.s,data=d)
# This takes the form of plot(y~x, data=d). The two forms create
# identical output.
# get the coefficients using the `coef` function
cf1 <- coef(fit1)
cf1
a_fit <- cf1['a']
b_fit <- cf1['b1']
# create a grid of hypothetical ages to predict
n_grid <- 400
age_grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# add a line to the plot for the
# "deterministic" portion of the model:
# mu <- a + b1*age.s
lines(age_grid , a_fit + b_fit*age_grid)
```
This shows us the overall trend predicted by the data, but ignores the uncertainty in our estimates of $\mu$ that comes from uncertainty in `a` and `b1`.
One simple way to try to express this uncertainty is to draw the same line as above multiple times, using draws from our posterior rather than the maximum a-posteriori estimates.
```{r plot_multipoint_estimate}
# get a sample of coefficients
n_samp <- 30
post_samp <- extract.samples(fit1,n_samp)
head(post_samp)
# plot the data in a fresh plot
plot(d$age.s,d$log_income)
# (re)create a grid of hypothetical ages to predict
n_grid <- 400
age_grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# create a loop to plot each of the lines
for(i in 1:n_samp){
# each time we go through this loop, i will take
# on one of the values between 1 and n_samp (30).
# get the coefficients from the i'th row of the sample s
a_fit <- post_samp$a[i]
b_fit <- post_samp$b1[i]
# plot the line, but make it slightly "translucent" for clarity
lines(age_grid , a_fit + b_fit*age_grid , col="#00000011")
}
```
This gives us an idea of the spread of uncertainty in our model, but can be hard to read or interpret. A common approach here is to replace the pile of spaghetti in this plot with a shaded region that will show, say, the 90% credible interval for our posterior.
The procedure for this is straightforward:
1. For every point along `grid`, draw a large number of posterior samples (say 1000) for `a` and `b1`.
2. Calculate the value of $\mu$ for each of those samples.
3. Calculate a credible interval on $\mu$ for each point long `grid`.
There are lots of ways to do this in R, and the `rethinking` package has some tools to make it easier. But we will first do it "manually" to make it clear what is going on.
```{r plot_mean_predicted_thehardway}
# create a grid of hypothetical ages to predict
n_grid <- 400
age_grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# how many samples at each grid point?
n_samp <- 1000
s1 <- extract.samples(fit1,n_samp)
head(s1)
# use the `sapply()` function to calculate the 90% credible
# interval at each point x in the grid
mu_ci <- sapply(age_grid,function(x){
# this is like a loop, where x will take on every value
# in the grid
quantile(s1$a + s1$b1 * x , probs=c(.05,.95))
})
# see what the output of this looks like
mu_ci
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `polygon` function to plot a shaded region
polygon(
x = c(age_grid,rev(age_grid)), # rev() returns its argument in reverse order
y = c(mu_ci[1,],rev(mu_ci[2,])),
border = NA , col = '#00000055')
```
Looks good! But a bit of a pain. Let's try the same thing using some functions from the `rethinking` package:
* `link()` will draw samples from posterior distribution of $\mu$ and calculated predicted values from the data data.
* `PI()` calculates the percentile interval on a vector.
* `shade()` is just a convenient wrapper around `polygon()` that we used above.
```{r plot_mean_predicted1}
# create a grid of hypothetical ages to predict
n_grid <- 400
# create a data frame listing our grid as possible values of age.s
pred_dat <- data.frame(
age.s = seq(min(d$age.s),max(d$age.s),length.out=n_grid)
)
head(pred_dat)
# how many samples at each grid point?
n_samp <- 1000
# use the 'link' function to take n_samp posterior samples at each point in pred_dat
mu_post <- link(fit1, data=pred_dat, n=n_samp)
# mu_post has one row for each of our posterior samples, and one column
# for each point in our grid.
dim(mu_post)
```
```{r plot_mean_predicted2}
# calculate the credible interval for every column of mu_post
# `apply()` here will call PI(mu[,i],prob=0.9) for every column i.
mu_ci <- apply(mu_post, 2, PI, prob=0.9)
```
```{r plot_mean_predicted3}
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `shade()` function from rethinking to draw the shaded region
shade(mu_ci, pred_dat$age.s)
```
### Posterior predictive plots
Posterior predictive plots are constructed using pretting much the same procedure, with the important difference that we are incorporating two sources of randomness. We still use randomness to characterize our uncertainty about the posterior distribution of $\mu$, but now we also want to take into account the _modeled_ uncertainty in $\sigma$.
Again, we'll make the posterior predictive plot twice: once by hand and then using functions from the `rethinking` package.
```{r post_predictive_thehardway}
# create a grid of hypothetical ages to predict
n_grid <- 400
age_grid <- seq(min(d$age.s),max(d$age.s),length.out=n_grid)
# how many samples at each grid point?
n_samp <- 1000
s1 <- extract.samples(fit1,n_samp)
# use the `sapply()` function to draw a random sample
# from Norm(mu,sigma) at each point in the posterior
# Then calculate the 90% credible interval at each point x in the grid
sim_ci <- sapply(age_grid,function(x){
x_samp <- rnorm(n_samp,s1$a + s1$b1 * x,s1$sigma)
quantile(x_samp,probs=c(.05,.95))
})
# see what the output of this looks like
sim_ci
```
```{r post_predictive_thehardway2}
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `polygon` function to plot a shaded region
# this is identical to the code from above, but
# plots `sim_ci` instead of `mu_ci`
polygon(
x = c(age_grid,rev(age_grid)),
y = c(sim_ci[1,],rev(sim_ci[2,])),
border = NA , col = '#00000055')
```
Again, the `rethinking` package makes this a bit simpler. Now we will use the `sim()` function to simulate predicted data from the model.
```{r post_predictive_rethinking}
n_grid <- 400
# create a data frame listing our grid as possible values of age.s
pred_dat <- data.frame(
age.s = seq(min(d$age.s),max(d$age.s),length.out=n_grid)
)
```
```{r post_predictive_rethinking2}
n_samp <- 1000
# use the 'sim' function to take n_samp posterior samples at each point in pred_dat
sim_samp <- sim(fit1, data=pred_dat, n=n_samp)
# mu_post has one row for each of our posterior samples, and one column
# for each point in our grid.
dim(sim_samp)
```
```{r post_predictive_rethinking3}
# calculate the credible interval for every column of mu_post
# `apply()` here will call PI(mu[,i],prob=0.9) for every column i.
sim_ci <- apply(sim_samp, 2, PI, prob=0.9)
#sim_ci
```
```{r post_predictive_rethinking4}
# plot the data
plot(d$age.s,d$log_income,pch=16)
# use the `shade()` function from rethinking to draw the shaded region
shade(sim_ci, pred_dat$age.s)
```