---
title: "Assignment 1, SOCI 620, Winter 2022"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
_Due Tues, Jan 25_
In response to recent reports that the [COVID-19 test positivity rates in Alberta are well over 30%](https://www.cbc.ca/news/canada/calgary/alberta-covid-coronavirus-january-17-1.6317482), we will imagine that the Explorers and Producers Association of Canada (EPAC) publishes a press release claiming otherwise. In this imagined press release, they claim to have surveyed a number of patients who recieved the test, and performed a sophisticated Bayesian analysis that shows with 95% confidence that the positivity rate is _below_ 30%. They note that they included prior information from a panel of experts to help bolster their analylsis, and they include the following depiction of the posterior probability density of positivity in their publication:
![](https://soci620.netlify.app/problem_sets/img/positivity.png)
In light of the study's conflict of interest (the oil industry does not want to shut down workplaces), you decide to investigate. The study gives links to the raw survey data and you are able to extract data representing a grid approximation of the posterior shown in the figure, but there is no information about what prior they used other than stating it was based on "expert input." You believe you can use the information they did provide reconstruct the prior.
## 1. Load the data
The raw survey data is at , and the 'reconstructed' posterior is at . Use the `read.csv()` command to load these, storing them in variables `surv_data` and `post_grid`, respectively.
```{r load}
surv_data <- read.csv('https://soci620.netlify.com/data/ps1/test_data.csv')
post_grid <- read.csv('https://soci620.netlify.com/data/ps1/positivity.csv')
```
## 2. Inspect the data
Inspect the data in `surv_data` and calculate the sample size and absolute number of respondents who tested positive. Report both of these numbersâ€”does the claim made in the study seem reasonable in light of the survey results?
```{r inspect}
summary(surv_data)
# sample size is the number of rows
ssize <- nrow(surv_data)
# how many positive?
numpos <- sum(surv_data$pos_test)
```
The sample has just `r ssize`, and of these `r numpos` had a positive test. This means that `r round(100*numpos/ssize,1)`% of the surveyed respondents tested positive, making the study's claims very suspect.
## 3. Likelihood
Use the survey results calculated in the last step to construct a likelihood on the same grid provided in `post_grid` (you will probably want to use the `dbinom()` function). Plot the likelihood. Comparing the raw likelihood to the posterior published in the study, what do you suspect about the prior they might have used?
```{r likelihood}
grid <- post_grid$positivity
likelihood <- dbinom(numpos,ssize,grid)
plot(grid,likelihood, type='l', main="Likelihood", xlab='positivity')
```
The raw likelihood has a center of mass somewhere around 0.3, while the posterior shown in the study is centered around 0.2. This suggests that the study used a prior that put significant weight on lower values of the positivity rate.
## 4. Reconstruct the prior
Use the likelihood you just calculated and the posterior provided in `post_grid` to calculate values that are proportional to the prior used in the study. (Hint: if $A\propto B\times C$, then $C\propto A/B$.) Plot this prior. Does this look like a "reasonable" prior to you? Why or why not?
```{r calc_prior}
calc_prior <- post_grid$posterior_prob / likelihood
plot(grid,calc_prior, type='l', main="Study prior", xlab='positivity')
```
The prior used in the study is extremely informative, putting almost all of the probability mass less than 0.1. This amounts to assuming a very low positivity rate before undertaking the study, which seems quite unreasonable.
## 5. Building a better posterior
Use a uniform prior over the proportion of people testing positive to construct your own approximate (grid-normalized) posterior. Use that posterior to take 10,000 random samples from the posterior. What is the approximate mean of this new posterior? What is the posterior probability that more than 30% of the population tested positive? What do you conclude about the true positivity rate? Can you put some bounds on what you think the real positivity rate is, given the data?
```{r new_posterior}
# make the prior (this will be all 1s)
prior <- dunif(grid,0,1)
# calculate the posterior
# (since the prior is uniform, the posterior will
# end up equal to the likelihood, but we can
# calculate it anyway...)
posterior <- likelihood*prior
# normalize the posterior to turn it into
# a valid discrete distribution
posterior <- posterior / sum(posterior)
# sample
psamp <- sample(grid,10000,replace=TRUE,prob=posterior)
# use the sample to describe the distribution
mean(psamp)
mean(psamp>.3)
ci90 <- quantile(psamp,c(.05,.95))
```
The posterior mean is approximately `r round(mean(psamp),3)`, and the posterior probability that the positivity rate is greater than 30% is about `r round(mean(psamp>.3),3)`. This provides weak evidence that the positivity rate is greater than 30%, but we cannot rule out that it might be less than 30% using this data along. We can make a good case (with 90% posterior probabiliyt) that the positivity rate is somewhere between `r round(ci90[1]*100,1)`% and `r round(ci90[2]*100,1)`%.