---
title: "SOCI 620: Lab 9"
author: "Peter McMahan"
date: "2023-02-07"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
## Logistic regression: prior predictive plots
We'll start by specifying a model that predicts whether or not a student has ever tried smoking cigarettes, using an aggregatet score of "delinquent or undesirable behaviors" over the previous 12 months.
Start by loading the data:
```{r data}
d <- read.csv("https://soci620.netlify.com/data/delinquency.csv")
head(d)
```
Let's create a standardized version of the delinquency score to make things easier.
```{r delinq_hist}
d$delinquency_s <- as.numeric(scale(d$delinquency_agg))
summary(d$delinquency_s)
hist(d$delinquency_s)
```
Our model looks like this, where $C_i$ is an indicator for having smoked, and $D_i$ is the standardized delinquency score.
$$
\begin{aligned}
C_i &\sim \mathrm{Bernoulli}(p_i)\\
\mathrm{logit}(p_i) &= \alpha + \beta D_i\\
\\
\alpha &\sim \mathrm{Norm}(0,\mbox{??})\\
\beta &\sim \mathrm{Norm}(0,\mbox{??})
\end{aligned}
$$
How should we decide what to use for the "??" standard deviations in the priors? Prior predictive simulation is one good tool in making this kind of decision. It involves drawing random samples from our prior, and seeing what kind of outcomes our model would then predict *if we didn't update that prior with our data*.
Let's start by picking two relatively "flat" priors:
$$\alpha \sim \mathrm{Norm}(0,5)\\
\beta \sim \mathrm{Norm}(0,5)$$
One way to assess these priors is to plot the prior predicted distribution of probability of smoking for a few 'example' values on the delinquency scale:
```{r flatpiors1}
# how many simulations to make (this can be big)
n_sim <- 1e4 # 10,000
# take n_sim samples from the prior for alpha and beta
a_sim <- rnorm(n_sim, mean = 0, sd = 5.0)
b_sim <- rnorm(n_sim, 0, 5.0)
# plot the prior predicted density at delinquency_s = 0
# (this is the mean delinquency score, and identical to
# just the prior on alpha)
# Also: give the `density` function `from` and `to` arguments,
# since inv_logit can't return any thing outside that range
dens0 <- density( inv_logit(a_sim + b_sim * 0) ,from = 0, to = 1)
plot(
dens0,
main="Prior predictions at D=0",
xlab="prob. of smoking",
ylim = range(0,dens0$y)
)
# try the same for student with above-average delinquency
dens2 <- density( inv_logit(a_sim + b_sim * 2) ,from = 0, to = 1)
plot(
dens2,
main="Prior predictions at D=2",
xlab="prob. of smoking",
ylim = range(0,dens2$y)
)
# or how about for the least-delinquent student in the sample?
mindelinq_s <- min(d$delinquency_s, na.rm=TRUE)
densmin <- density( inv_logit(a_sim + b_sim * mindelinq_s) ,from = 0, to = 1)
plot(
densmin,
main="Prior predictions at min(D)",
xlab="prob. of smoking",
ylim = range(0,densmin$y)
)
```
Those don't look especially 'uninformative,' especially for non-zero values of delinquency. To see what's going on, it can be useful to plot a few of the prior-predicted relationships across all values of `delinquency_s`.
```{r flatpriors2}
# what is the reasonable range of our predictor variable?
# (we'll just use its actual range)
d_range <- range(d$delinquency_s,na.rm=TRUE)
# make a grid of hypothetical values for delinquency
d_grid <- seq(from=d_range[1], to=d_range[2],length.out=400)
# make an empty plot by plotting NA
# (when you do this, you have to tell it your axis ranges explicitly)
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
# loop over the first 100 values of a_sim and b_sim to plot the simulated relationship
for(i in 1:100){
lines(
d_grid,
inv_logit(a_sim[i] + b_sim[i] * d_grid),
col = '#00000055'
)
}
```
We can see that the priors we chose think that the most likely relationship is either (a) students with low delinquency scores are *very* unlikely to have smoked and studets with high delinquency scores are *very* likely to smoke, or (b) the inverse. Our priors are showing a strong preference for a "threshold" relationship, in which students are either almost certain to have smoked or almost certain to have not smoked, whith some threshold value of `delinquency_s` at which they flip from one to the other.
Let's try the same plot with some smaller standard deviations on our priors.
```{r, narrowpriors1}
# change the prior for alpha to have a standard deviation of 1.5
a_sim <- rnorm(n_sim,0,1.5)
b_sim <- rnorm(n_sim,0,5.0)
# with new values fro a_sim and b_sim, the rest of the code
# for plotting is identical to above
# make an empty plot by plotting NA
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
# loop over the first 100 values of a_sim and b_sim to plot the simulated relationship
for(i in 1:100){
lines(
d_grid,
inv_logit(a_sim[i] + b_sim[i] * d_grid),
col = '#00000055'
)
}
```
```{r, narrowpriors2}
# change the prior for beta to have a standard deviation of 0.3, but leave alpha at 5.0
a_sim <- rnorm(n_sim,0,5.0)
b_sim <- rnorm(n_sim,0,0.3)
# make an empty plot by plotting NA
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
# loop over the first 100 values of a_sim and b_sim to plot the simulated relationship
for(i in 1:100){
lines(
d_grid,
inv_logit(a_sim[i] + b_sim[i] * d_grid),
col = '#00000055'
)
}
```
```{r, narrowpriors3}
# put both prior standard deviations at smaller values
a_sim <- rnorm(n_sim,0,1.5)
b_sim <- rnorm(n_sim,0,0.3)
# make an empty plot by plotting NA
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
# loop over the first 100 values of a_sim and b_sim to plot the simulated relationship
for(i in 1:100){
lines(
d_grid,
inv_logit(a_sim[i] + b_sim[i] * d_grid),
col = '#00000055'
)
}
```
The last simulation uses these priors:
$$
\begin{aligned}
\alpha &\sim \mathrm{Norm}(0,1.5)\\
\beta &\sim \mathrm{Norm}(0,0.3)
\end{aligned}
$$
It spreads the probability mass of possible relationships between delinquency and the probability of smoking over a wide range of a-priori reasonable scenarios.
## Estimating the logistic regression using `ulam()`
Now we'll try estimating that logistic regression using the `ulam()` function. This function does a lot of work, and will always take some time to runâ€”even on very simple models. This is because it is writing a dedicated program in the low-level programming language C++, and then compiling and optimizing that program for your computer. Once this is done, the program runs very quickly. This is somewhat tedious for simple models, but is incredibly useful for more complex models.
```{r, fit1, cache=TRUE}
# ulam wants a version of the data with only the
# variables in the model
d_slim <- d[c('ever_smoked','delinquency_s')]
# drop incomplete cases
cc <- complete.cases(d[c('ever_smoked','delinquency_s')])
d_slim <- d_slim[cc,]
model1 <- alist(
ever_smoked ~ dbinom(1,p),
logit(p) <- a + bd*delinquency_s,
a ~ normal(0,1.5),
bd ~ normal(0,0.3)
)
# adding refresh=0 suppresses a lot of the output
fit1 <- ulam(model1, data=d_slim, chains = 4, cores = 4, iter = 6000, refresh=0)
precis(fit1)
```