In this notebook, we will learn how to estimate linear regression models in a Bayesian framework. This all may seem like a lot of extra work since `lm()`

is fast and easy to use, but learning these techniques on a model you are familiar with will help later on when we start working with more complex models (I promise).

We will start by loading the same data as last time.

In [1]:

```
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2017.csv")
# again, subsample for less precise estimates:
d <- d[sample(nrow(d),200),]
head(d)
```

female | age | race | hispanic | education | income | log_income | |
---|---|---|---|---|---|---|---|

<int> | <int> | <chr> | <int> | <chr> | <int> | <dbl> | |

23922 | 0 | 52 | Other Asian or Pacific Islander | 0 | 4+ years college | 137500 | 11.831379 |

31081 | 0 | 31 | Two major races | 0 | 4+ years college | 48000 | 10.778956 |

4683 | 1 | 87 | White | 0 | No HS | 26100 | 10.169691 |

10788 | 1 | 49 | Black/African American/Negro | 0 | Some college | 50000 | 10.819778 |

9822 | 1 | 46 | White | 0 | HS | 8800 | 9.082507 |

33079 | 0 | 25 | Two major races | 0 | HS | 15000 | 9.615805 |

Let's build our first model. We'll make an `alist`

describing a model predicting log income as normally distributed, but with a mean that depends on a person's gender.

(Note: the US census, which provided this data, does not collect information on non-binary gender—everyone is either a man or a woman. This is an unfortunate limitation of a great deal of existing survey data and is only just beginning to change. Although I recoded and formatted this data, I left the label "female" as it is in the raw data.)

In [2]:

```
m1 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b*female,
a ~ dnorm(0,30),
b ~ dnorm(0,30),
sigma ~ dunif(0,50)
)
```

The big difference between this model and models we've been using so far is that `mu`

is defined using `<-`

instead of `~`

. In the `rethinking`

package and in Stan, the "tilde" `~`

is used for *stochastic* or *random* relations, while the "assignment arrow" `<-`

is used for *deterministic* relations. This tells the model that `log_income`

is considered random, even if you know `mu`

and `sigma`

, but that you can calculate `mu`

exactly as long as you have speific values for `a`

, `b`

, and `female`

.

We can estimate this model using `quap()`

:

In [3]:

```
library(rethinking)
fit1 <- quap(m1,data=d)
```

The `precis()`

function will describe the marginal posterior distributions for each of our parameters, producing output similar to a regular regresssion:

In [4]:

```
# this command tells jupyter what dimensions to use for its plots.
options(repr.plot.width=12, repr.plot.height=8)
# `prob` specifies the size of the credible interval to calculate
# `digits` speifies how many digits to show for each result
precis(fit1, prob=.95, digits=3)
```

mean | sd | 2.5% | 97.5% | |
---|---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> | |

a | 10.4960075 | 0.12359212 | 10.253771 | 10.7382436 |

b | -0.7154405 | 0.16976623 | -1.048176 | -0.3827048 |

sigma | 1.1982904 | 0.05991435 | 1.080860 | 1.3157203 |

This means that the mean of the marginal posterior for `a`

is about 10.496.

We can extract the coefficients from this object to get the expected income for men and women, conditional on the data we have seen.

In [5]:

```
# the objects produced by the quap() function have an attribute
# called "coef" that holds the mean estimates for each parameter
# you ucan access these with the `coef()` function
coef(fit1)
# we access individual estimates by indexing with the parameter name
a_mean = coef(fit1)['a']
b_mean = coef(fit1)['b']
# It is simple to calculate the mean predicted income for men and women
# (remembering that our model predicts log income)
mean_income_men <- exp(a_mean)
mean_income_women <- exp(a_mean + b_mean)
# display
print(mean_income_men)
print(mean_income_women)
```

- a
- 10.4960075131513
- b
- -0.71544052465236
- sigma
- 1.19829037256947

a 36170.8 a 17686.68

What about the joint posterior? We can tell something about it by examining the ** variance-covariance matrix** of the coefficients. A variance-covariance matrix tells you the variance (standard deviation squared) of each variable in a joint distribution, along with the

In [6]:

```
vc <- vcov(fit1)
# rounding to 8 digits for legibility
round(vc,8)
```

a | b | sigma | |
---|---|---|---|

a | 0.01527501 | -0.01527478 | -0.00000114 |

b | -0.01527478 | 0.02882057 | 0.00000121 |

sigma | -0.00000114 | 0.00000121 | 0.00358973 |

In the above variance-covariance matrix, the elements along the diagonal tell us that $Var(a)=0.01527501$, $Var(b)=0.02882057$, and $Var(\sigma)=0.00358973$ (these are just the standard deviations reported above, squared). But more interestingly, it tells us that `a`

and `b`

have a pretty significant *covariance* $Cov(a,b)=-0.01527478$. This means that higher values for `a`

tend to go along with lower values for `b`

, and vice versa.

We can look a little more closely at this distribution by taking a sample from it.

In [7]:

```
samp1 <- extract.samples(fit1,3000)
head(samp1)
cor(samp1)
```

a | b | sigma | |
---|---|---|---|

<dbl> | <dbl> | <dbl> | |

1 | 10.49599 | -0.8778490 | 1.203274 |

2 | 10.63402 | -0.8984670 | 1.224987 |

3 | 10.35110 | -0.6601713 | 1.287654 |

4 | 10.58294 | -0.6143084 | 1.116734 |

5 | 10.63870 | -0.9638076 | 1.206363 |

6 | 10.38241 | -0.7968158 | 1.288667 |

a | b | sigma | |
---|---|---|---|

a | 1.000000000 | -0.74172575 | 0.009667356 |

b | -0.741725749 | 1.00000000 | -0.031414546 |

sigma | 0.009667356 | -0.03141455 | 1.000000000 |

In [8]:

```
# plotting a data frame or list plots each pair of variables
plot(samp1,pch=16,cex=.8,col="#00000022")
# (`pch=16` changes the shape of each point to a solid circle,
# `cex=.8` makes each point 80% as big as it would normally be,
# and `col="#00000022"` makes the points 'translucent')
plot(samp1$a,samp1$b)
```