Samples from random variables are central to Bayesian data analysis. As we discussed in class, the exact posterior distribution for a given model is most often difficult or impossible to calculate. But it is relatively straightforward to calculate a non-normalized function that is *proportional* to the exact posterior. For decades, researchers have been coming up with clever ways to use that non-normalizied posterior to ** generate random samples** from the exact posterior. The upshot of this is that as statisticians, we can say just about anything we want to about a distribution using just a large sample from the posterior.

Start by taking a sample of 2,000 values from a Beta distribution with parameters $\alpha=2,\beta=50$: $\mathrm{Beta}(2,50)$

In [1]:

```
a <- 2
b <- 50
s <- rbeta(2000,a,b)
# the head() function shows you the first several values from a large vector or table
head(s,20)
```

- 0.0274213474961461
- 0.0345631400343158
- 0.121625658219538
- 0.0173781753936176
- 0.0880282267047069
- 0.0121741664570732
- 0.0400917389359398
- 0.00494634439647034
- 0.0568854628658444
- 0.0412966961543777
- 0.0293739987951746
- 0.00194028469677279
- 0.0751044727559561
- 0.0113516232127686
- 0.0235191245772273
- 0.0207784157761254
- 0.0112467025215983
- 0.0476260280500133
- 0.0277503337289659
- 0.0451730194900177

(As a side note, the Beta distribution is the exact posterior for a binomial model like our unemployment model if $p$ has a Beta prior. Because $\mathrm{Beta}(1,1)$ is just the uniform distribution, $\mathrm{Beta}(2,50)$ is the posterior with a uniform prior if we have seen exactly 49 employed people and 1 unemployed person.)

Use a histogram to get a sense of the shape of the distribution.

In [2]:

```
# this command tells jupyter what dimensions to use for its plots.
options(repr.plot.width=12, repr.plot.height=8)
# the hist() function tries to make a good gu7ss about the number of
# bins to use, but you can give it a hint by supplying a `breaks` argument.
# Try increasing breaks to see how it starts to be less useful.
hist(s,xlim=c(0,1),breaks=50)
```

Another way to see the overall shape of a distribution is using R's `density()`

function, which takes a random sample from a continuous distribution and tries to construct a 'smooth' probability density function from it. By default `density()`

assumes the sample is from an *unbounded* continuous random variable. If you have strict upper and lower bounds like we do (Beta is bounded in [0,1]), it is important to let the function know this information. (What happens if you leave off the `from`

and `to`

arguments in the code below?)

In [3]:

```
library(rethinking)
dens(s)
d <- density(s,from=0,to=1)
plot(d)
# since we know the exact distribution in this case (we usually don't),
# overlay it for comparison
#lines(d$x,dbeta(d$x,a,b),lty=3)
names(d)
```

Loading required package: rstan Loading required package: StanHeaders Loading required package: ggplot2 rstan (Version 2.21.2, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Loading required package: parallel rethinking (Version 2.13) Attaching package: ‘rethinking’ The following object is masked from ‘package:stats’: rstudent

- 'x'
- 'y'
- 'bw'
- 'n'
- 'call'
- 'data.name'
- 'has.na'

Caclulating the mean and median of the sample is simple, but calculating the mode is a little more complicated

In [4]:

```
# R has built in functions for mean and median
smean <- mean(s)
smedian <- median(s)
# we can approximate the mode by looking for the value
# of p that maximizes the density we calculated before
smode <- d$x[which.max(d$y)]
centers <- c(smean,smedian,smode)
names(centers) <- c('mean','median','mode')
print(centers)
# visualize these (zooming in on the left side of the
# density for visual clarity)
plot(d,xlim=c(0,0.2),lwd=2)
abline(v=centers,lty=c(1,2,3))
legend('topright',legend=names(centers),lty=c(1,2,3))
```

mean median mode 0.03956546 0.03302253 0.02152642

We can use R's `quantile`

function to easily calculate percentile intervals. We should expect the percentile interval to be approximately centered around the median.

In [5]:

```
# for a 50% credible interval, we want to leave 25% on either end.
# (This function just puts the sample in order and finds the values that
# are 25% and 75% of the way through the ordered list.)
perc <- quantile(s,probs=c(0.25,0.75))
print(perc)
plot(d,xlim=c(0,0.2),lwd=2)
abline(v=perc,lty=3)
legend('topright',legend='50% percentile interval',lty=3)
```

25% 75% 0.01913428 0.05390006

The highest posterior density interval is more complex, but we can use the `HPDI()`

function from the `rethinking`

package. (Note: this function is just an easier way to call the `HPDinterval()`

function in the `coda`

package). We should expect the HPDI estimate to be approximately centered around the mode, and so a little to the left of the percentile interval.

In [6]:

```
library(rethinking)
hpdi <- HPDI(s,.5)
print(hpdi)
plot(d,xlim=c(0,0.2),lwd=2)
abline(v=hpdi,lty=3)
legend('topright',legend='50% HPD interval',lty=3)
```

|0.5 0.5| 0.01054474 0.03900621