In this notebook, we will learn how to estimate linear regression models in a Bayesian framework. This all may seem like a lot of extra work since
lm() is fast and easy to use, but learning these techniques on a model you are familiar with will help later on when we start working with more complex models (I promise).
We will start by loading the same data as last time.
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2017.csv") # again, subsample for less precise estimates: d <- d[sample(nrow(d),200),] head(d)
|23922||0||52||Other Asian or Pacific Islander||0||4+ years college||137500||11.831379|
|31081||0||31||Two major races||0||4+ years college||48000||10.778956|
|10788||1||49||Black/African American/Negro||0||Some college||50000||10.819778|
|33079||0||25||Two major races||0||HS||15000||9.615805|
Let's build our first model. We'll make an
alist describing a model predicting log income as normally distributed, but with a mean that depends on a person's gender.
(Note: the US census, which provided this data, does not collect information on non-binary gender—everyone is either a man or a woman. This is an unfortunate limitation of a great deal of existing survey data and is only just beginning to change. Although I recoded and formatted this data, I left the label "female" as it is in the raw data.)
m1 <- alist( log_income ~ dnorm(mu,sigma), mu <- a + b*female, a ~ dnorm(0,30), b ~ dnorm(0,30), sigma ~ dunif(0,50) )
The big difference between this model and models we've been using so far is that
mu is defined using
<- instead of
~. In the
rethinking package and in Stan, the "tilde"
~ is used for stochastic or random relations, while the "assignment arrow"
<- is used for deterministic relations. This tells the model that
log_income is considered random, even if you know
sigma, but that you can calculate
mu exactly as long as you have speific values for
We can estimate this model using
library(rethinking) fit1 <- quap(m1,data=d)
Loading required package: rstan Loading required package: StanHeaders Loading required package: ggplot2 rstan (Version 2.21.2, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Loading required package: parallel rethinking (Version 2.13) Attaching package: ‘rethinking’ The following object is masked from ‘package:stats’: rstudent
precis() function will describe the marginal posterior distributions for each of our parameters, producing output similar to a regular regresssion:
# this command tells jupyter what dimensions to use for its plots. options(repr.plot.width=12, repr.plot.height=8) # `prob` specifies the size of the credible interval to calculate # `digits` speifies how many digits to show for each result precis(fit1, prob=.95, digits=3)
This means that the mean of the marginal posterior for
a is about 10.496.
We can extract the coefficients from this object to get the expected income for men and women, conditional on the data we have seen.
# the objects produced by the quap() function have an attribute # called "coef" that holds the mean estimates for each parameter # you ucan access these with the `coef()` function coef(fit1) # we access individual estimates by indexing with the parameter name a_mean = coef(fit1)['a'] b_mean = coef(fit1)['b'] # It is simple to calculate the mean predicted income for men and women # (remembering that our model predicts log income) mean_income_men <- exp(a_mean) mean_income_women <- exp(a_mean + b_mean) # display print(mean_income_men) print(mean_income_women)
a 36170.8 a 17686.68
What about the joint posterior? We can tell something about it by examining the variance-covariance matrix of the coefficients. A variance-covariance matrix tells you the variance (standard deviation squared) of each variable in a joint distribution, along with the covariance of each pair of variables. (Covariance is similar to correlation, but it has not been standardized to fit into [-1,1]).
vc <- vcov(fit1) # rounding to 8 digits for legibility round(vc,8)
In the above variance-covariance matrix, the elements along the diagonal tell us that $Var(a)=0.01527501$, $Var(b)=0.02882057$, and $Var(\sigma)=0.00358973$ (these are just the standard deviations reported above, squared). But more interestingly, it tells us that
b have a pretty significant covariance $Cov(a,b)=-0.01527478$. This means that higher values for
a tend to go along with lower values for
b, and vice versa.
We can look a little more closely at this distribution by taking a sample from it.
samp1 <- extract.samples(fit1,3000) head(samp1) cor(samp1)
# plotting a data frame or list plots each pair of variables plot(samp1,pch=16,cex=.8,col="#00000022") # (`pch=16` changes the shape of each point to a solid circle, # `cex=.8` makes each point 80% as big as it would normally be, # and `col="#00000022"` makes the points 'translucent') plot(samp1$a,samp1$b)