---
title: "SOCI 620: Lab 8"
author: "Peter McMahan"
date: "2/1/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
# Estimating population probabilities
In this lab, we will learn use a simple logistic regression with no covariates to estimate the base probability of cocaine use in a population of students. The data we are using is from Add Health, a well-known longitudinal dataset of adolescent health in the United States from the 1990s.
```{r get_addhealth_data}
d <- read.csv("https://soci620.netlify.app/data/delinquency.csv")
head(d)
```
The model is relatively strightforward:
$$
\begin{aligned}
C_i &\sim \mathrm{Bernoulli}{p}\\
\mathrm{logit}(p) &= \alpha\\
\alpha &\sim \mathrm{Norm}(0, 1.5)
\end{aligned}
$$
We use the `logit()` function as a link, and the `ulam()` function to estimate the model. The syntax for `ulam()` is almost identical to `quap()`, but it can estimate a much larger class of models. Note that we specify the Bernoulli distribution with `dbinom` in the model below. The Bernoulli distribution is just a special case of the binomial distribution, so many formulations of logistic regressions use the more general binomal distribution. These are mathematically identical.
(This function will take a little time to run. It is writing and compiling a C++ program to estimate our model for us.)
```{r m1}
library(rethinking)
# define a model
m1 <- alist(
ever_cocaine ~ dbinom(1,p),
logit(p) <- a,
a ~ dnorm(0,1.5)
)
# make a data frame with only the data for this model
d1 <- d[c('ever_cocaine')]
d1 <- na.omit(d1)
# estimate the posterior with `ulam`
f1 <- ulam(m1,data=d1,chains = 4, cores=4,iter = 10000)
precis(f1)
```
Use the `inv_logit()` function from the `rethinking` package to see what the estimate and credible interval mean in terms of probability. (Note that `logistic()` is the same as `inv_logit()`.)
```{r m1_to_percent}
# expected percentage
inv_logit(coef(f1)[1])
# about 3.4%
# can also use `logistic`
logistic(coef(f1)[1])
```
```{r m1_density}
# extract posterior samples
s <- extract.samples(f1)
summary(s$a)
# convert to percentage scale:
post_p <- inv_logit(s$a)
# plot the density of percentages from a sample
plot(density(post_p)) # posterior probability
```
## Reframing as a binomial model
In the previous model, each case was seen as a single draw from a binomial distribution with a sample size of 1. But at the beginning of the term, we had a model of unemployment that used the aggregated data. The advantage of the framing we used above is that we will be able to specify a different base probability for each individual, depending on their covariates. But for the sake of completeness, let's see what this model would look like using that binomail framing.
First we'll create 'aggregated' data with one row listing the number of students and the nubmer of studnets who have ever used cocaine:
```{r ag_data}
# get just non-missing values of `ever_cocaine`
ec <- d$ever_cocaine[!is.na(d$ever_cocaine)]
# get number of "trials" and the number of "successes" for the binomial model
n <- length(ec)
n_true <- sum(ec)
# put it in a data frame for the estimation
d_aggregate <- data.frame(n=n,n_true=n_true)
d_aggregate
```
Then it is a simple matter to specify and estimate the model. Note that this prior has slightly different implications than the one used above.
```{r m2}
# define the model
m2 <- alist(
n_true ~ dbinom(n,p),
p ~ dunif(0,1)
)
# estimate using ulam()
f2 <- ulam(m2,data=d_aggregate,cores=4,chains=4)
precis(f2,digits=4)
```
Compare estimates from logistic regression with the estimates from this binomial model.
```{r f1_f2_compare}
# regression mean
inv_logit(coef(f1)[1])
# binomial model mean
coef(f2)[1]
```
```{r f1_f2_compare2}
# regression PI
PI(inv_logit(extract.samples(f1)$a),prob=.9)
# binomial model PI
PI(extract.samples(f2)$p,prob=.9)
```