---
Title: "SOCI 620: Worksheet 1"
Author: "your name here"
due: 2023-01-19
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
In this worksheet, you will be working with some simulated data about the prevalence of self-reported [long COVID](https://en.wikipedia.org/wiki/Long_COVID) (defined here as experiencing COVID symptoms at least four weeks after initially contracting the disease) among adults in the United Kingdom. The data is _artificial_, but was created using actual estimates from the dataset ["Prevalence of ongoing symptoms following coronavirus (COVID-19) infection in the UK"](https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/datasets/alldatarelatingtoprevalenceofongoingsymptomsfollowingcoronaviruscovid19infectionintheuk).
The simulated data contains information on the age of hypothetical respondents, whether they self-report experiencing symptoms of long COVID ("Would you describe yourself as having 'long COVID', that is, you are still experiencing symptoms more than 4 weeks after you first had COVID-19, that are not explained by something else?"), and whether long COVID "reduce[s] your ability to carry-out day-to-day activities compared with the time before you had COVID-19?".
The simulated data is available online in CSV format at .
# Part 1: Explore the data
**Question 1.1:** Load the data directly from the URL provided above into a data frame named `longcovid`. (Replace the "..." below)
```{r q1.1}
longcovid <- ...
```
**Question 1.2:** Take a look at this data. How many rows and columns does it have? What do the rows and columns represent?
```{r q1.2}
(your code here)
```
(your response here)
**Question 1.3:** What is the overall proportion of the sample that reports having symptoms of long COVID? Store this value in the variable named `sample_prop_lc`.
```{r q1.3}
sample_prop_lc <- ...
```
# Part 2: Generate a prior
You want to use this data to learn about the _probability_ that a person in the UK has symptoms of long COVID. This value is _unobserved_, meaning you do not know what it is. But you will need some way to talk about this unobserved probability, so you can refer to it as $p$.
Like a good Bayesian, you know that if you are going to learn anything about $p$, you will need to first determine _prior_ probability distribution over the possible values of $p$.
**Question 2.1:** One way to quantify the relative probabilities of different values of $p$ is to define a discrete "grid" of values to approximate a continuous distribution, like we did in the lab in class. The code chunk below defines the size of this grid for you (`n_grid <- 100001`). Complete the code to create a variable named `grid` with `n_grid` evenly spaced values between 0.0 and 1.0.
```{r q2.1}
n_grid <- 100001
grid <- ...
```
**Question 2.2:** A common (but often not recommended) prior distribution for a bounded value like $p$ is the uniform distribution: $p \sim \mathrm{Unif}(0,1)$. Use the `dunif()` function get the probability density of the uniform distribution at each value of the grid you just defined, saving the values to a variable named `prior_unif`.
```{r q2.2}
```
**Question 2.3:** The important information that a prior distribution holds is the _relative_ probability of different parameter values. In the code chunk below, calculate the ratio of the prior probability of $p=0$ to the prior probability of $p=0.5$: $Pr(p=0)/Pr(p=0.5)$. Do the same for $Pr(p=0)/Pr(p=1)$. (Note that this will only work because of these values is defined in the grid) What do your answers indicate about the uniform prior?
```{r q2.3}
uniform_ratio_0.0_0.5 <- ...
uniform_ratio_0.0_1.0 <- ...
```
**Question 2.4:** Just to drive the point of home, plot the density of the uniform prior. The horizontal axis should be the values of your `grid`, and the vertical axis should be the values of `prior_unif`. The vertical axis should include the value 0. (You will probably want to use the `type='l'` argument to `plot` to create a line plot rather than a scatter plot).
```{r q2.4}
```
**Question 2.5:** Another common prior distribution for bounded parameters like $p$ is the [beta distribution](https://en.wikipedia.org/wiki/Beta_distribution). This distribution has two hyperparameters, named `shape1` and `shape2` in R, that define its shape. To get a sense for this, use the `dbeta()` function to calculate the values of a beta distribution with parameters `shape1 = 1.5` and `shape2 = 2.0` at each value of your `grid`. Store these values in a vector named `prior_beta`.
```{r q2.5}
```
**Question 2.6:** Using this new prior, report the ratios $Pr(p=0.1)/Pr(p=0.5)$ and $Pr(p=0.1)/Pr(p=0.9)$, and assign them to the variables in the code block below. Then describe what you can infer about this distribution from the ratios.
```{r q2.6}
beta_ratio_0.1_0.5 <- ...
beta_ratio_0.1_0.9 <- ...
```
**Question 2.7:** Plot this beta prior density like you did the uniform prior.
```{r q2.7}
```
**Question 2.8:** In the R console or in a separate script (not in this worksheet) play around with the beta distribution, testing different values of `shape1` and `shape2` and plotting the result on `grid`. Come up with two prior distributions for $p$: one that aims to be "unopinionated" (does not express strong _a priori_ opinions about $p$), and another that is "opinionated" (expresses a strong opinion about what values $p$ will take). Assign the vectors of probability densities to variables named `prior_unopinionated` and `prior_opinionated` (As a starting point, note that so-called 'non-informative' priors will often be specified with shape parameters barely greater than 1.0).
```{r q2.8}
```
**Question 2.9:** Use your grid approximations of the two priors (opinionated and unopinionated) to draw a random sample (12,000 draws each) from each of the priors. Assign these to variables named `s_prior_unopinionated` and `s_prior_opinionated`.
```{r q2.9}
```
**Question 2.10:** Use the samples from the previous question to calculate the prior probability that $p < 0.05$ for each prior. Save these probabilities to variables named `prior5_unopinionated` and `prior5_opinionated`. Describe in words what these values mean.
```{r q2.10}
```
# Part 3: Likelihood
Now that you have established a prior (or two), it is time to define your model likelihood. Remember that the likelihood describes the probability of your observed data (prevalence of long COVID in your sample) for any particular value of your parameters ($p$, or the marginal probability of having long COVID for any single person in the UK).
**Question 3.1:** What is the _observed_ data in this data? What probability distribution can be used to describe the data based on $p$?
(your response here)
**Question 3.2:** Calculate the number of respondents who reported having symptoms of long COVID, and assign the result to the variable `num_lc`. Assign the total sample size to the variable `ssize`.
```{r q3.2}
```
**Question 3.3:** Now use the `dbinom` function to calculate the _likelihood_ of the data for each value of $p$ in `grid`. Store this vector in the variable `likelihood`.
```{r q3.3}
```
**Question 3.4:** Now plot this likelihood for each value of `grid`. Describe what you see. What does the vertical axis (the likelihood) represent here? How does the _likelihood_ plotted here differ from the _probability densities_ you plotted in part 2?
```{r q3.4}
```
# Part 4: Building a posterior
Recall that your original goal was to learn about the "true" values of $p$ using this data. In Bayesian statistics, this means you want to describe the _posterior distribution_ of $p$, conditional on the data $D$. With a prior and a likelihood, it is trivial to calculate values that are proportional to the posterior distribution you are after. Bayes rule tells us that the posterior distribution $Pr(p | D)$ is proportional to the product of the likelihood $Pr(D | p)$ and the prior distribution $Pr(p)$:
$$
Pr(p | D) \propto Pr(D | p) \times Pr(p)
$$
**Question 4.1:** Calculate two versions of the posterior distribution: one that uses the "unopinionated" prior from part 2, and the other that uses the "opinionated" prior. Assign these to variables named `post_unopinionated` and `post_opinionated`.
```{r q4.1}
```
**Question 4.2:** Plot these two posteriors. How much do they differ visually? Does it look like the opinionated prior had much of an effect?
```{r q4.2}
```
**Question 4.3:** Now draw random samples of size 12,000 from each of the (approximate) posterior distributions. Assign these to variables named `s_post_unopinionated` and `s_post_opinionated`.
```{r q4.3}
```
**Question 4.4:** Use these two samples to describe the posterior distributions based on the unopinionated and opinionated priors numerically. Use at least one measure of "center" (e.g. mean, median) and at least one measure of "spread" (e.g. credible interval, HPDI). What do these tell you about the (marginal) probability of having long COVID in the UK?
```{r q4.4}
```