We'll start by specifying a model that predicts whether or not a student has ever tried smoking cigarettes, using an aggregatet score of "delinquent or undesirable behaviors" over the previous 12 months.

Start by loading the data:

In [1]:

```
d <- read.csv("https://soci620.netlify.com/data/delinquency.csv")
head(d)
```

male | race_white | race_black | race_american_indian | race_asian_pacificislander | race_other | hispanic | grade | fam_income | fam_logincome | ever_smoked | num_smoked | ever_alcohol | num_drinks | ever_marijuana | ever_cocaine | delinquency_agg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

<int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |

1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 55000 | 10.915107 | 1 | 1 | 1 | 5 | 1 | 0 | 5 |

2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 11 | NA | NA | 1 | 8 | 1 | 10 | 0 | 0 | 4 |

3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 10 | 45000 | 10.714440 | 1 | NA | 0 | NA | 0 | 0 | 1 |

4 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 12 | 9000 | 9.105091 | 1 | 1 | 0 | NA | 1 | NA | 4 |

5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 12 | 2000 | 7.601402 | 1 | NA | 1 | 1 | 0 | 0 | 5 |

6 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 70000 | 11.156265 | 0 | NA | 0 | NA | 0 | 0 | 1 |

In [2]:

```
dim(d)
cor(d[,c('grade','race_white','delinquency_agg')],use = 'complete')
plot(jitter(d$grade),jitter(d$delinquency_agg),pch=16,col='#00000022')
```

- 6504
- 17

grade | race_white | delinquency_agg | |
---|---|---|---|

grade | 1.000000000 | 0.002866783 | -0.009501075 |

race_white | 0.002866783 | 1.000000000 | -0.034935030 |

delinquency_agg | -0.009501075 | -0.034935030 | 1.000000000 |

Let's create a standardized version of the delinquency score to make things easier.

In [3]:

```
d$delinquency_s <- scale(d$delinquency_agg)
summary(d$delinquency_s)
# This command is only necessary for reshaping the plot
# to make it visible on the projector in class
options(repr.plot.width=12, repr.plot.height=8)
hist(d$delinquency_s)
```

Our model looks like this, where $C_i$ is an indicator for having smoked, and $D_i$ is the standardized delinquency score.

$$C_i \sim \mathrm{Bernoulli}(p_i)\\ \mathrm{logit}(p_i) = \alpha + \beta D_i\\ \\ \alpha \sim \mathrm{Norm}(0,\mbox{??})\\ \beta \sim \mathrm{Norm}(0,\mbox{??}) $$How should we decide what to use for the "??" standard deviations in the priors? Prior predictive simulation is one good tool in making this kind of decision. It involves drawing random samples from our prior, and seeing what kind of outcomes our model would then predict *if we didn't update that prior with our data*.

Let's start by picking two relative "flat" priors:

$$\alpha \sim \mathrm{Norm}(0,5)\\ \beta \sim \mathrm{Norm}(0,5)$$In [4]:

```
# load rethinking package and set some options to make things faster later on
library(rethinking)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
```

In [5]:

```
# how many simulations to make
n_sim <- 1e4 # 10,000
# what is the reasonable range of our predictor variable?
# (we'll just use its actual range)
d_range <- range(d$delinquency_s,na.rm=TRUE)
# make a grid of hypothetical values for delinquency
d_grid <- seq(from=d_range[1], to=d_range[2],length.out=400)
# take n_sim samples from the prior for alpha and beta
a_sim <- rnorm(n_sim,mean=0,sd=5.0)
b_sim <- rnorm(n_sim,0,5.0)
# For each point on d_grid, calculate the value of p for
# each of the sampled parameter values
p_sim <- sapply(
1:n_sim,
function(i){
# inv_logit() is exactly the same as logistic()
inv_logit(a_sim[i] + d_grid*b_sim[i])
}
)
dim(p_sim)
```

- 400
- 10000

We have simulated 10,000 relationships between delinquency and smoking prevalence that our priors think are most reasonable, and calculated each of those relationships over 400 hypothetical values of `delinquency_s`

. (Note, in our case `delinquency_s`

only takes 16 distinct values. But we are treating it as a continuous variable for the sake of the model.)

We can use this to look at the distribution of smoking probabilities for various meaningful values of `delinquency_s`

by looking at individual rows of `p_sim`

.

In [6]:

```
# look at the predicted p for the mean value of delinquency
# (delinquency_s=0)
# Our grid doesn't contain a value for exactly zero.
# This is a trick to find the point in our grid that is closest to zero:
# take the absolute value of d_grid, and then find the location
# of the minimum value in that vector
min_d_grid <- which.min(abs(d_grid))
d_grid[min_d_grid]
# grab just that row and plot it
s <- p_sim[min_d_grid,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[min_d_grid]))
```

0.00225689726488698

That doesn't look so good. Let's see what it predicts at the minimum and maximum values of delinquency.

In [7]:

```
# minimum delinquency is just the first row of p_sim
s <- p_sim[1,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[1]))
# maximum delinquency is the 400th row row of p_sim
s <- p_sim[400,]
plot(density(s,from=0,to=1),main=sprintf("Prior predictive density at D=%.2f",d_grid[400]))
```