--- title: "SOCI 620: Lab 4" author: "Peter McMahan" date: "1/19/2023" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(rethinking) ``` # Bayesian regressions In this notebook, we will learn how to estimate linear regression models in a Bayesian framework. This all may seem like a lot of extra work since `lm()` is fast and easy to use, but learning these techniques on a model you are familiar with will help later on when we start working with more complex models (I promise). We will start by loading the same data as last time. ```{r load_data} d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2017.csv") # create subsample so our estimates will be less precise: d <- d[sample(nrow(d),200),] head(d) ``` Let's build our first model. We'll make an `alist` describing a model predicting log income as normally distributed, but with a mean that depends on a person's gender. (Note: the US census, which provided this data, allowed for exactly two gender categories for respondents—everyone is either "male" or "female". This is an unfortunate limitation of a great deal of existing survey data and is only just beginning to change. Although I recoded and formatted this data, I left the label "female" as it is in the raw data.) ```{r define_model} m1 <- alist( log_income ~ dnorm(mu,sigma), mu <- a + b*female, a ~ dnorm(0,30), b ~ dnorm(0,30), sigma ~ dunif(0,50) ) ``` The big difference between this model and models we've been using so far is that `mu` is defined using `<-` instead of `~`. In the `rethinking` package and in Stan, the "tilde" `~` is used for *stochastic* or *random* relations, while the "assignment arrow" `<-` is used for *deterministic* relations. This tells the model that `log_income` is considered random, even if you know `mu` and `sigma`, but that you can calculate `mu` exactly as long as you have speific values for `a`, `b`, and `female`. We can estimate this model using `quap()`: ```{r fit1} fit1 <- quap(m1,data=d) ``` The `precis()` function will describe the marginal posterior distributions for each of our parameters, producing output similar to a regular regression: ```{r precis1} # `prob` specifies the size of the credible interval to calculate # `digits` specifies how many digits to show for each result precis(fit1, prob=.95, digits=3) ``` This means that the mean of the marginal posterior for `a` is about `r round(coef(fit1)['a'],3)`. We can extract the coefficients from this object to get the expected income for men and women, conditional on the data we have seen. ```{r coef1} # the objects produced by the map() function have an attribute # called "coef" that holds the mean estimates for each parameter print(coef(fit1)) # we access individual estimates by indexing with the parameter name a_mean = coef(fit1)['a'] b_mean = coef(fit1)['b'] # It is simple to calculate the mean predicted income for men and women # (remembering that our model predicts log income) mean_income_men <- exp(a_mean) mean_income_women <- exp(a_mean + b_mean) print(c(men=mean_income_men,women=mean_income_women)) ``` What about the joint posterior? We can tell something about it by examining the ***variance-covariance matrix*** of the coefficients. A variance-covariance matrix tells you the variance (standard deviation squared) of each variable in a joint distribution, along with the *covariance* of each pair of variables. (Covariance is similar to correlation, but it has not been standardized to fit into [-1,1]). ```{r cov1} vc <- vcov(fit1) # rounding to 8 digits for legibility round(vc,8) ``` In the above variance-covariance matrix, the elements along the diagonal tell us that $Var(a)=`r vc['a','a']`$, $Var(b)=`r vc['b','b']`$, and $Var(\sigma)=`r vc['sigma','sigma']`$ (these are just the standard deviations reported above, squared). But more interestingly, it tells us that `a` and `b` have a pretty significant *covariance* $Cov(a,b)=`r vc['a','b']`$. This means that higher values for `a` tend to go along with *lower* values for `b`, and vice versa. We can look a little more closely at this distribution by taking a sample from it. ```{r samp1_head} samp1 <- extract.samples(fit1,3000) head(samp1) ``` ```{r samp1_cor} cor(samp1) ``` ```{r samp1_plotall} # plotting a data frame or list plots each pair of variables plot(samp1 ,pch=16, cex=.8, col="#00000022") # (`pch=16` changes the shape of each point to a solid circle, # `cex=.8` makes each point 80% as big as it would normally be, # and `col="#00000022"` makes the points 'translucent') # this plots just a versus b plot(samp1$a,samp1$b,pch=16, cex=.8, col="#00000022") ``` Now that we have these samples, we can also easily test specific relationships, like the posterior probability that women earn less than men on average. (Recall that our sample is very small, so our certainty is artificially deflated) ```{r gendergap} mean(samp1$b<0) ``` We can also work with the samples to create posteriors for parameters we didn't explicitly estimate. It could be useful to compare the posterior distribution for the mean earnings for men (`a`) with the posterior distribution for the earnings of women (`a + b`). ```{r posterior_by_gender} posterior_mu_men <- density(samp1$a) posterior_mu_women <- density(samp1$a + samp1$b) # get good limits for the plot xlim <- range(posterior_mu_men$x, posterior_mu_women$x) ylim <- range(posterior_mu_men$y, posterior_mu_women$y) #plot men's density plot( posterior_mu_men, xlab='log earnings', xlim=xlim,ylim=ylim, main="Posterior distribution, mean log earnings") #add women's density lines(posterior_mu_women,lty=3) # add a legend legend('topleft',lty=c(1,3),legend=c("men","women")) ``` The beginnings of a ggplot version of the same: ```{r posterior_by_gender_ggplot} ggplot(samp1) + geom_density(aes(x = a, linetype = "men")) + geom_density(aes(x = a + b, linetype = "women")) + xlab("log earnings") + ggtitle("Posterior distribution, mean log earnings") ``` We can also easily create posterior distributions for the non-log dollars, simply using the `exp()` function to exponentiate the sample values. ```{r posterior_by_gender_exp} posterior_expmu_men <- density(exp(samp1$a)) posterior_expmu_women <- density(exp(samp1$a + samp1$b)) # get good limits for the plot xlim <- range(posterior_expmu_men$x, posterior_expmu_women$x) ylim <- range(posterior_expmu_men$y, posterior_expmu_women$y) plot(posterior_expmu_men,xlab='earnings',xlim=xlim,ylim=ylim,main="Poterior distribution, mean earnings") lines(posterior_expmu_women,lty=3) # add a legend legend('topright',lty=c(1,3),legend=c("men","women")) ``` At the risk of cramming too much into one figure, we can plot our *posterior distributions of the mean earnings of men and women* alongside the *density of mean earnings of men and women in the data*. This will look a little confusing, but it is important to understand the difference between these two. In the figure below, the thick lines represent the distribution of actual earnings we see in the full data we are using (`d`). The thin, narrow distributions show us the what we knwo about the *average* of earnings of men and women. ```{r posterior_vs_data} posterior_expmu_men <- density(exp(samp1$a)) posterior_expmu_women <- density(exp(samp1$a + samp1$b)) data_men <- density(d$income[d$female==0],from=0) data_women <- density(d$income[d$female==1],from=0) # get good limits for the plot xlim <- range(posterior_expmu_men$x, posterior_expmu_women$x,data_men$x,data_women$x) ylim <- range(posterior_expmu_men$y, posterior_expmu_women$y,data_men$y,data_women$y) plot(posterior_expmu_men,xlab='earnings',xlim=xlim,ylim=ylim,main="") lines(posterior_expmu_women,lty=3) lines(data_men,lwd=3,col="#00000066") lines(data_women,lwd=3,lty=3,col="#00000066") # add a legend legend('topright',lty=c(1,3,1,3), lwd=c(1,1,3,3),col=c('black','black',"#00000066","#00000066"), legend=c("posterior mean (men)","posterior mean (women)","data distribution (men)","data distribution (women)")) ``` ### Adding covariates To make the model more complete, let's add covariates for age and college education. ```{r fit2} # construct a binary variable for college education d$college <- d$education=="4+ years college" # make a new model m2 <- alist( log_income ~ dnorm(mu,sigma), mu <- a + bf*female + ba*age + bc*college, a ~ dnorm(0,30), bf ~ dnorm(0,30), ba ~ dnorm(0,30), bc ~ dnorm(0,30), sigma ~ dunif(0,50) ) fit2 <- map(m2,data=d) precis(fit2,prob=.95,digits=3) vcov(fit2) ```