---
title: "SOCI 620: Lab 9"
author: "Peter McMahan"
date: "2/10/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
## Logistic regression: prior predictive plots
We'll start by specifying a model that predicts whether or not a student has ever tried smoking cigarettes, using an aggregatet score of "delinquent or undesirable behaviors" over the previous 12 months.
Start by loading the data:
```{r, data}
d <- read.csv("https://soci620.netlify.com/data/delinquency.csv")
head(d)
```
Let's create a standardized version of the delinquency score to make things easier.
```{r, delinq_hist}
d$delinquency_s <- as.numeric(scale(d$delinquency_agg))
summary(d$delinquency_s)
hist(d$delinquency_s)
```
V1
Min. :-0.9975
1st Qu.:-0.6332
Median :-0.2689
Mean : 0.0000
3rd Qu.: 0.4597
Max. : 4.4671
NA's :103
Our model looks like this, where $C_i$ is an indicator for having smoked, and $D_i$ is the standardized delinquency score.
$$
\begin{aligned}
C_i &\sim \mathrm{Bernoulli}(p_i)\\
\mathrm{logit}(p_i) &= \alpha + \beta D_i\\
\\
\alpha &\sim \mathrm{Norm}(0,\mbox{??})\\
\beta &\sim \mathrm{Norm}(0,\mbox{??})
\end{aligned}
$$
How should we decide what to use for the "??" standard deviations in the priors? Prior predictive simulation is one good tool in making this kind of decision. It involves drawing random samples from our prior, and seeing what kind of outcomes our model would then predict *if we didn't update that prior with our data*.
Let's start by picking two relatively "flat" priors:
$$\alpha \sim \mathrm{Norm}(0,5)\\
\beta \sim \mathrm{Norm}(0,5)$$
```{r, flatpriors}
# how many simulations to make
n_sim <- 1e4 # 10,000
# what is the reasonable range of our predictor variable?
# (we'll just use its actual range)
d_range <- range(d$delinquency_s,na.rm=TRUE)
# make a grid of hypothetical values for delinquency
d_grid <- seq(from=d_range[1], to=d_range[2],length.out=400)
# take n_sim samples from the prior for alpha and beta
a_sim <- rnorm(n_sim, mean = 0, sd = 5.0)
b_sim <- rnorm(n_sim, 0, 5.0)
# For each point on d_grid, calculate the value of p for
# each of the sampled parameter values
p_sim <- sapply(
1:n_sim,
function(i){
# inv_logit() is exactly the same as logistic()
inv_logit(a_sim[i] + d_grid*b_sim[i])
}
)
dim(p_sim)
```
We have simulated 10,000 relationships between delinquency and smoking prevalence that our priors think are most reasonable, and calculated each of those relationships over 400 hypothetical values of `delinquency_s`. (Note, in our case `delinquency_s` only takes 16 distinct values. But we are treating it as a continuous variable for the sake of the model.)
We can use this to look at the distribution of smoking probabilities for various meaningful values of `delinquency_s` by looking at individual rows of `p_sim`.
```{r, flatpriors_meanplot}
# look at the predicted p for the mean value of delinquency
# (delinquency_s=0)
# Our grid doesn't contain a value for exactly zero.
# This is a trick to find the point in our grid that is closest to zero:
# take the absolute value of d_grid, and then find the location
# of the minimum value in that vector
min_d_grid <- which.min(abs(d_grid))
d_grid[min_d_grid]
# grab just that row and plot it
s <- p_sim[min_d_grid,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[min_d_grid]))
```
0.00225689726488698
That doesn't look so good. Let's see what it predicts at the minimum and maximum values of delinquency.
```{r, flatpriors_extrnplot}
# minimum delinquency is just the first row of p_sim
s <- p_sim[1,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[1]))
# maximum delinquency is the 400th row row of p_sim
s <- p_sim[400,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[400]))
```
Those are even worse. To see what's going on, it can be useful to plot a few of the prior-predicted relationships across all values of `delinquency_s`.
```{r, flatpriors_rangeplot}
# make an empty plot by plotting NA
# (when you do this, you have to tell it your axis ranges explicitly)
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
# loop over the first few columns of p_sim to plot the simulated relationship
for(i in 1:20){
lines(d_grid,p_sim[,i])
}
```
We can get a better idea of what is going on by plotting more.
```{r, flatpriors_rangeplot2}
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
# loop over the first few hundred columns of p_sim to plot the simulated relationship
for(i in 1:400){
lines(d_grid,p_sim[,i],col="#00000044")
}
```
We can see that the priors we chose think that the most likely relationship is either (a) students with low delinquency scores are *very* unlikely to have smoked and studets with high delinquency scores are *very* likely to smoke, or (b) the inverse. Our priors are showing a strong preference for a "threshold" relationship, in which students are either almost certain to have smoked or almost certain to have not smoked, whith some threshold value of `delinquency_s` at which they flip from one to the other.
Let's try the same plot with some smaller standard deviations on our priors.
```{r, narrowpriors1}
# change the prior for alpha to have a standard deviation of 1.5
a_sim <- rnorm(n_sim,0,1.5)
b_sim <- rnorm(n_sim,0,5.0)
p_sim <- sapply(
1:n_sim,
function(i){
inv_logit(a_sim[i] + d_grid*b_sim[i])
}
)
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
for(i in 1:400){
lines(d_grid,p_sim[,i],col="#00000044")
}
```
```{r, narrowpriors2}
# change the prior for beta to have a standard deviation of 0.3, but leave alpha at 5.0
a_sim <- rnorm(n_sim,0,5.0)
b_sim <- rnorm(n_sim,0,0.3)
p_sim <- sapply(
1:n_sim,
function(i){
inv_logit(a_sim[i] + d_grid*b_sim[i])
}
)
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
for(i in 1:400){
lines(d_grid,p_sim[,i],col="#00000044")
}
```
```{r, narrowpriors3}
# put both prior standard deviations at smaller values
a_sim <- rnorm(n_sim,0,1.5)
b_sim <- rnorm(n_sim,0,0.3)
p_sim <- sapply(
1:n_sim,
function(i){
inv_logit(a_sim[i] + d_grid*b_sim[i])
}
)
plot(NA,xlim=d_range,ylim=c(0,1),xlab='delinquency_s',ylab='Pr(ever_smoked)')
for(i in 1:400){
lines(d_grid,p_sim[,i],col="#00000044")
}
```
```{r, narrowpriors_extrplots}
# minimum delinquency is just the first row of p_sim
s <- p_sim[1,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[1]))
# maximum delinquency is the 400th row row of p_sim
s <- p_sim[400,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[400]))
```
The last simulation uses these priors:
$$
\begin{aligned}
\alpha &\sim \mathrm{Norm}(0,1.5)\\
\beta &\sim \mathrm{Norm}(0,0.3)
\end{aligned}
$$
It spreads the probability mass of possible relationships between delinquency and the probability of smoking over a wide range of a-priori reasonable scenarios.
## Estimating the logistic regression using `ulam()`
Now we'll try estimating that logistic regression using the `ulam()` function. This function does a lot of work, and will always take some time to runâ€”even on very simple models. This is because it is writing a dedicated program in the low-level programming language C++, and then compiling and optimizing that program for your computer. Once this is done, the program runs very quickly. This is somewhat tedious for simple models, but is incredibly useful for more complex models.
```{r, fit1, cache=TRUE}
# find complete cases
d_slim <- d[c('ever_smoked','delinquency_s')]
cc <- !(is.na(d_slim$ever_smoked) | is.na(d_slim$delinquency_s))
d_slim <- d_slim[cc,]
# note: to make scaled values work with ulam, you may need to 'cast'
# them as numeric
d_slim$delinquency_s <- as.numeric(d_slim$delinquency_s)
m1 <- alist(
ever_smoked ~ dbinom(1,p),
logit(p) <- a + bd*delinquency_s,
a ~ normal(0,1.5),
bd ~ normal(0,0.3)
)
# adding refresh=0 suppresses a lot of the output
f1 <- ulam(m1,data=d_slim,chains = 4,cores=4, iter = 6000, refresh=0)
precis(f1)
```