Logistic regression: prior predictive plots

We'll start by specifying a model that predicts whether or not a student has ever tried smoking cigarettes, using an aggregatet score of "delinquent or undesirable behaviors" over the previous 12 months.

Start by loading the data:

Let's create a standardized version of the delinquency score to make things easier.

Our model looks like this, where $C_i$ is an indicator for having smoked, and $D_i$ is the standardized delinquency score.

$$C_i \sim \mathrm{Bernoulli}(p_i)\\ \mathrm{logit}(p_i) = \alpha + \beta D_i\\ \\ \alpha \sim \mathrm{Norm}(0,\mbox{??})\\ \beta \sim \mathrm{Norm}(0,\mbox{??}) $$

How should we decide what to use for the "??" standard deviations in the priors? Prior predictive simulation is one good tool in making this kind of decision. It involves drawing random samples from our prior, and seeing what kind of outcomes our model would then predict if we didn't update that prior with our data.

Let's start by picking two relative "flat" priors:

$$\alpha \sim \mathrm{Norm}(0,5)\\ \beta \sim \mathrm{Norm}(0,5)$$

We have simulated 10,000 relationships between delinquency and smoking prevalence that our priors think are most reasonable, and calculated each of those relationships over 400 hypothetical values of delinquency_s. (Note, in our case delinquency_s only takes 16 distinct values. But we are treating it as a continuous variable for the sake of the model.)

We can use this to look at the distribution of smoking probabilities for various meaningful values of delinquency_s by looking at individual rows of p_sim.

That doesn't look so good. Let's see what it predicts at the minimum and maximum values of delinquency.