---
title: "Problem Set 0"
author: "Your Name"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The aim of this problem set is to give students who are less familiar with R and RMarkdown some practice working with simple data structures.
For a very good introduction to R, I recommend Norm Matloff's [fasteR](https://github.com/matloff/fasteR) tutorial. If you can get through part 1 of this then you will be more than prepared for this class.
For RMarkdown, there are a number of good resources online. [RStudio provides a step-by-step tutorial](https://rmarkdown.rstudio.com/lesson-1.html) that goes into much more detail than you will need for this class. It also provides handy a handy [2-page 'cheat sheet'](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf) and a slightly less dense [reference guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf).
Note: *You will be turning in all of your problem sets as 'raw' RMarkdown files like this. You should be able to press the 'knit' button at the top of this panel and see a nicely formatted version of this document at any point.*
# COVID data
> **Content warning**: The content below will examine deaths from COVID-19.
This problem set will guide you through a simple descriptive analysis of the history of COVID cases and deaths in Québec. We'll be using data from Données Québec, which I've made available on the course website at .
## Q1. Load the data
The first step will be to load the data. You *could* download it to your computer and load it from there, but one nice feature of R is that you can easily load data from the web directly into R.
### Q1.1
In the code block below, use the `read.csv()` function with the URL above to download the data and assign it to a variable named `covid`.
```{r q1_1}
# Your code here
covid <- read.csv("https://soci620.netlify.app/data/COVID19_Qc.csv")
```
### Q1.2
When you encounter a new dataset it is always a good idea to figure out its structure. In the code block below, use functions like `head()`, `nrow()`, and `summary()` to look at the overall contents of the data frame you just created. How many observations (rows) does it contain? How many variables? Can you tell what the variables represent just by their names? (There is a [data description in French at Données Québec](https://www.donneesquebec.ca/recherche/dataset/a2073e4a-9426-4946-95b5-c560d43c216e/resource/95fa42e1-f036-420d-aeb9-1f48937bb21e/download/listevariables_notesmetho_tsp_20201119.pdf))
```{r q1_2}
# Your code here
nrow(covid)
head(covid)
summary(covid)
```
## Q2. Active cases
The variable `Nb_Cas_Actifs` contains the daily number of active COVID cases between Jan 23, 2020 and Jan 8, 2022.
### Q2.1
What was the lowest reported number of active cases over this time period? The highest? Over this time period, what were the mean and median number of active cases? Calculate these values in the code block below, and then give a plain-language description of the results below that.
```{r q2_1}
# Your code here
min(covid$Nb_Cas_Actifs)
max(covid$Nb_Cas_Actifs)
median(covid$Nb_Cas_Actifs)
mean(covid$Nb_Cas_Actifs)
```
The number of active daily cases ranged from `r min(covid$Nb_Cas_Actifs)` at a minimum to `r max(covid$Nb_Cas_Actifs)` at its highest. The mean number of cases over the time period was `r mean(covid$Nb_Cas_Actifs)`, while the median was `r median(covid$Nb_Cas_Actifs)`
### Q2.2
Now visualize the number of active cases over time. Since the dataset is already in temporal order (the first row is the earliest day and the last row is the latest day) you can simply `plot()` the `Nb_Cas_Actifs` variable. You will probably want to specify `type='l'` to tell R to do a `l`ine plot instead of a standard scatter plot. What trends do you see?
```{r q2_2}
# Your code here
plot(covid$Nb_Cas_Actifs, type='l',
xlab = 'day number', ylab = 'number of active cases')
```
## Q3. Cases and deaths
Finally, we will take a look at the relationship between daily active cases and deaths attributed to COVID-19. The variable `Nb_Nvx_Deces_Total` contains the total number of new deaths recorded on each day.
### Q3.1
What was the minimum, maximum, and mean number of daily deaths over the time period covered in the dataset?
```{r q3_1}
# Your code here
min(covid$Nb_Nvx_Deces_Total)
max(covid$Nb_Nvx_Deces_Total)
mean(covid$Nb_Nvx_Deces_Total)
```
### Q3.2
What proportion of days had zero COVID-attributed deaths?
```{r q3_2}
# Your code here
mean(covid$Nb_Nvx_Deces_Total==0)
```
### Q3.3
Plot the number of new deaths over time (like you did with active cases in Q2.2). What patterns do you see?
```{r q3_3}
# Your code here
plot(covid$Nb_Nvx_Deces_Total, type='l',
xlab = 'day number', ylab = 'number of new deaths')
```
### Q3.5
What is the correlation between daily active cases and new deaths? Is it positive? What does the sign of the number mean (loosely) in plain language?
```{r q3_5}
# Your code here
cor(covid$Nb_Cas_Actifs, covid$Nb_Nvx_Deces_Total)
```
### Q3.6
Create a scatter plot comparing daily active cases to new deaths. Using what you saw in the rest of this problem set, what patterns do you see? What potential explanations for those patterns would you propose?
```{r q3_6}
# Your code here
plot(covid$Nb_Cas_Actifs, covid$Nb_Nvx_Deces_Total)
```
## Q4. Epistemological perspective
(*You don't necessarily need to respond to this question, but it is good to spend some time considering the relationship between the data you just analyzed and the social world it aims to describe.*)
Reflect for a moment on the source of this data and the variables it provides. How did the data's structure (choice of variables, observational frame, etc) affect the kind of analyses we undertook? Are the categories it relies on unambiguous in the 'real world'? What kinds of knowledge about the COVID-19 pandemic are easy to construct with this data? What kinds are difficult?