--- title: "SOCI 620: Lab 2" author: "Peter McMahan" date: 2023-01-10 output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(rethinking) ``` # Working with random samples Samples from random variables are central to Bayesian data analysis. As we discussed in class, the exact posterior distribution for a given model is most often difficult or impossible to calculate. But it is relatively straightforward to calculate a non-normalized function that is *proportional* to the exact posterior. For decades, researchers have been coming up with clever ways to use that non-normalizied posterior to ***generate random samples*** from the exact posterior. The upshot of this is that as statisticians, we can say just about anything we want to about a distribution using just a large sample from the posterior. We'll start by taking a sample of 2,000 values from a Beta distribution with parameters $\alpha=2,\beta=50$: $\mathrm{Beta}(2,50)$ ```{r beta_sample} a <- 2 b <- 50 s <- rbeta(2000,a,b) # the head() function shows you the first several values from a large vector or table head(s,20) ``` (As a side note, the Beta distribution is the exact posterior for a binomial model like our unemployment model if $p$ has a Beta prior. Because $\mathrm{Beta}(1,1)$ is just the uniform distribution, $\mathrm{Beta}(2,50)$ is the posterior with a uniform prior if we have seen exactly 49 employed people and 1 unemployed person.) ## Histograms and density plots Once you have a random sample (pretend we don't know where it came from, for now), you can use a histogram to get a sense of the shape of the distribution. Note, the `hist()` function tries to make a good guess about the number of 'bins' to use, but you can give it a hint by supplying a `breaks` argument. Try increasing breaks to see how it starts to be less useful. ```{r histogram} hist(s, xlim=c(0,1), breaks=10) ``` Another way to see the overall shape of a distribution is using R's `density()` function, which takes a random sample from a continuous distribution and tries to construct a 'smooth' probability density function from it. By default `density()` assumes the sample is from an *unbounded* continuous random variable. If you have strict upper and lower bounds like we do (Beta is bounded in [0,1]), it is important to let the function know this information. (What happens if you leave off the `from` and `to` arguments in the code below?) The `dens()` function from the `rethinking` package makes plots a reasonable density plot with just one command: ```{r dens} dens(s) ``` R's built-in `density()` function requires two steps: ```{r density} d <- density(s,from=0,to=1) plot(d) # since we know the exact distribution in this case (we usually don't), # overlay it for comparison lines(d$x,dbeta(d$x,a,b),lty=3) ``` ## Mean, median, and mode Measures of "central tendency" try to summarize where the "middle" of a random sample is, Calculating the mean and median of the sample is simple, but calculating the mode is a little more complicated ```{r centers} # R has built in functions for mean and median smean <- mean(s) smedian <- median(s) # we can approximate the mode by looking for the value # of p that maximizes the density we calculated before smode <- d$x[which.max(d$y)] centers <- c(smean,smedian,smode) names(centers) <- c('mean','median','mode') print(centers) ``` ```{r centers_plot} # visualize these (zooming in on the left side of the # density for visual clarity) plot(d,xlim=c(0,0.2),lwd=2) abline(v=centers,lty=c(1,2,3)) legend('topright',legend=names(centers),lty=c(1,2,3)) ``` ## Intervals We can use R's `quantile` function to easily calculate percentile intervals. We should expect the percentile interval to be approximately centered around the median. ```{r quantile} # for a 50% credible interval, we want to leave 25% on either end. # (This function just puts the sample in order and finds the values that # are 25% and 75% of the way through the ordered list.) perc <- quantile(s,probs=c(0.25,0.75)) print(perc) ``` ```{r quantile_plot} plot(d,xlim=c(0,0.2),lwd=2) abline(v=perc,lty=3) legend('topright',legend='50% percentile interval',lty=3) ``` The highest posterior density interval is more complex, but we can use the `HPDI()` function from the `rethinking` package. (Note: this function is just an easier way to call the `HPDinterval()` function in the `coda` package). We should expect the HPDI estimate to be approximately centered around the mode, and so a little to the left of the percentile interval. ```{r hpdi} library(rethinking) hpdi <- HPDI(s,.5) print(hpdi) ``` ```{r hpdi_plot} plot(d,xlim=c(0,0.2),lwd=2) abline(v=hpdi,lty=3) legend('topright',legend='50% HPD interval',lty=3) ``` We can also easily use the sample to assign probabilities to particular research questions. What is the posterior probability that the unemployment rate is higher than 0.1? About 3%. ```{r hypothesis} mean(s<=0.1 & s>.05) ``` ## Drawing samples from a grid approximation Usually, we don't have a function like `rbeta` to easily draw samples from our posterior distribution. We will look at several ways to draw random samples from an arbitrary posterior distribution later in the semester. For now, we will learn how to approximate samples from a posterior using the grid approximations we have been working with. First let's construct a grid approximation for the situation described above (49 employed people, 1 unemployed person), which corresponds to a sample size of 50 and a binomial count of just 1. The code below does this with a grid of only 20 points. We'll want to increase that for a better approximation. ```{r make_grid} # make a small grid n_grid <- 20 grid <- seq(from=0,to=1,length.out=n_grid) # make the data ssize <- 50 n_unemp <- 1 # prior and likelihood prior <- dunif(grid) lik <- dbinom(n_unemp,size=ssize,prob=grid) # posterior post <- prior*lik post <- post/sum(post) plot(grid, post, type='l', xlab='p', ylab='') # show thin 'grid' lines abline(v=grid, lwd=.3, col='gray') ``` The grid approximation works by constructing a discrete distribution that matches the continuous distribution that is our posterior. This means we can use R's `sample()` function to take a random discrete sample. ```{r grid_sample} # sample the points along grid with probability defined by `post` # It is vital that you set `replace=TRUE` here. s_grid <- sample(grid,size=10000,prob=post,replace=TRUE) ``` Now let's calculate some descriptive stats to see how they compare. ### Centers ```{r grid_centers} gmean <- mean(s_grid) gmedian <- median(s_grid) d_grid <- density(s_grid,from=0,to=1) gmode <- d_grid$x[which.max(d_grid$y)] centers_grid <- c(gmean,gmedian,gmode) print(cbind(centers,centers_grid)) ``` ```{r grid_centers_plot} plot(d_grid) ``` ### Intervals ```{r grid_quantile} perc_grid <- quantile(s_grid,probs=c(0.25,0.75)) print(perc) print(perc_grid) ``` ```{r grid_hpdi} hpdi_grid <- HPDI(s_grid,.5) print(hpdi) print(hpdi_grid) ``` ### Hypotheses ```{r grid_hypothesis} print(mean(s>0.1)) print(mean(s_grid>0.1)) ```