Samples from random variables are central to Bayesian data analysis. As we discussed in class, the exact posterior distribution for a given model is most often difficult or impossible to calculate. But it is relatively straightforward to calculate a non-normalized function that is proportional to the exact posterior. For decades, researchers have been coming up with clever ways to use that non-normalizied posterior to generate random samples from the exact posterior. The upshot of this is that as statisticians, we can say just about anything we want to about a distribution using just a large sample from the posterior.
Start by taking a sample of 2,000 values from a Beta distribution with parameters $\alpha=2,\beta=50$: $\mathrm{Beta}(2,50)$
a <- 2
b <- 50
s <- rbeta(2000,a,b)
# the head() function shows you the first several values from a large vector or table
head(s,20)
(As a side note, the Beta distribution is the exact posterior for a binomial model like our unemployment model if $p$ has a Beta prior. Because $\mathrm{Beta}(1,1)$ is just the uniform distribution, $\mathrm{Beta}(2,50)$ is the posterior with a uniform prior if we have seen exactly 49 employed people and 1 unemployed person.)
Use a histogram to get a sense of the shape of the distribution.
# this command tells jupyter what dimensions to use for its plots.
options(repr.plot.width=12, repr.plot.height=8)
# the hist() function tries to make a good gu7ss about the number of
# bins to use, but you can give it a hint by supplying a `breaks` argument.
# Try increasing breaks to see how it starts to be less useful.
hist(s,xlim=c(0,1),breaks=50)
Another way to see the overall shape of a distribution is using R's density()
function, which takes a random sample from a continuous distribution and tries to construct a 'smooth' probability density function from it. By default density()
assumes the sample is from an unbounded continuous random variable. If you have strict upper and lower bounds like we do (Beta is bounded in [0,1]), it is important to let the function know this information. (What happens if you leave off the from
and to
arguments in the code below?)
library(rethinking)
dens(s)
d <- density(s,from=0,to=1)
plot(d)
# since we know the exact distribution in this case (we usually don't),
# overlay it for comparison
#lines(d$x,dbeta(d$x,a,b),lty=3)
names(d)
Loading required package: rstan Loading required package: StanHeaders Loading required package: ggplot2 rstan (Version 2.21.2, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Loading required package: parallel rethinking (Version 2.13) Attaching package: ‘rethinking’ The following object is masked from ‘package:stats’: rstudent
Caclulating the mean and median of the sample is simple, but calculating the mode is a little more complicated
# R has built in functions for mean and median
smean <- mean(s)
smedian <- median(s)
# we can approximate the mode by looking for the value
# of p that maximizes the density we calculated before
smode <- d$x[which.max(d$y)]
centers <- c(smean,smedian,smode)
names(centers) <- c('mean','median','mode')
print(centers)
# visualize these (zooming in on the left side of the
# density for visual clarity)
plot(d,xlim=c(0,0.2),lwd=2)
abline(v=centers,lty=c(1,2,3))
legend('topright',legend=names(centers),lty=c(1,2,3))
mean median mode 0.03956546 0.03302253 0.02152642
We can use R's quantile
function to easily calculate percentile intervals. We should expect the percentile interval to be approximately centered around the median.
# for a 50% credible interval, we want to leave 25% on either end.
# (This function just puts the sample in order and finds the values that
# are 25% and 75% of the way through the ordered list.)
perc <- quantile(s,probs=c(0.25,0.75))
print(perc)
plot(d,xlim=c(0,0.2),lwd=2)
abline(v=perc,lty=3)
legend('topright',legend='50% percentile interval',lty=3)
25% 75% 0.01913428 0.05390006
The highest posterior density interval is more complex, but we can use the HPDI()
function from the rethinking
package. (Note: this function is just an easier way to call the HPDinterval()
function in the coda
package). We should expect the HPDI estimate to be approximately centered around the mode, and so a little to the left of the percentile interval.
library(rethinking)
hpdi <- HPDI(s,.5)
print(hpdi)
plot(d,xlim=c(0,0.2),lwd=2)
abline(v=hpdi,lty=3)
legend('topright',legend='50% HPD interval',lty=3)
|0.5 0.5| 0.01054474 0.03900621