Estimating a model of normally distributed log income

In this notebook, we will model the log anual income in the United States using a simple Gaussian (normal) distribution. Because the model has two parameters, it will be easier to use maximum a-posterior estimates of the posterior distribution than to use the grid approximation as we have been.

We will be using data from the US Census' public-use microdata sample, a 1% representative sample of the US population, which was cleaned and organized by IPUMS-USA, University of Minnesota (www.ipums.org):

Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. IPUMS USA: Version 8.0 [dataset]. Minneapolis, MN: IPUMS, 2018. https://doi.org/10.18128/D010.V8.0

I have prepared a subsample of the data with only a few variables. There are just over 35,000 observations in the dataset. We begin by loading the data and looking at the first few rows.

Normally, we would want to use all the data we have, which with 35,000 observations will give us very precise estimates. Instead, we will use only 200 observations in this lab so that the uncertainty in the model is easier to see.

For now we will be ignoring all the variables except for log_income. Let's plot the density of log income so we know what we are looking at.

Although a normal approximation of the data is not perfect, it does a pretty good job. It's worth noting that our data is skewed slightly to the left.

Building a model

The rethinking package has a very nice syntax for building the kinds of models we will be using in this class. Recall from the slides that our model is specified like this: $$ y_i \sim \mathrm{Norm}(\mu,\sigma)\\ \mu \sim \mathrm{Norm}(0,30)\\ \sigma \sim \mathrm{Unif}(0,50) $$

The rethinking package allows us to copy this notation almost exactly into our R code. We will start by defining the model:

Models are specified by building an alist with the different relationships defined in our model. (An alist is just a special kind of list in R that does not try to calculate anything from the arguments you give it.) For our data, which we called $y_i$ in the formal model specification, we use the name of the variable in our data frame. The program is smart enough to figure out that mu and sigma are not part of the data and so must be parameters. Finally, notice that the distributions are specified using their "d" form: dnorm, dunif, etc.

Once we have the model built, it is simple to ask R to calculate the maximum a-posteriori given the data:

That was easy! quap() is a very fast function for estimating simple models, but it will unfortunately not work for some of the more complex models we are using. This is because it makes some pretty strong assumptions about our posterior distribution. MAP/QUAP works by searching over the unnormalized posterior of the model until it finds what it thinks is the highest point on that posterior (this is why it is called "maximum a-posteriori"). It then looks at how pointy the posterior is at that spot, and uses that information to approximate the posterior using a simple normal distribution (which is an exponentiated parabola, hence "quadratic approximation").

This kind of MAP procedure can fail for a few reasons. One of the common ways it can fail is if the posterior is not very similar to a normal distribution, or if it is not symmetric—think about the posteriors for the unemployment rate that were heavily right-skewed. Another possible situation that MAP will not work for is if the posterior is multi-modal (has more than one mode). This is not very common, but does happen with some complex models. If the posterior is multi-modal, MAP estimation will fail spectacularly.

But our model is very simple, and, as we saw in the slides, the posterior looks very much like a normal distribution. Still, we haven't actually seen what the model tells us yet. To do that, we can use the precis function (the prob argument tells it what size of credible interval to calculate).

What is this output? It can be read in a similar way to the output from a standard regression. We have two parameters that we are trying to learn about: mu and sigma. The precis function finds the posterior mean of each of these, and estimates the standard deviation based on a normal approximation. It then calculates a credible interval of size prob (because we used quap() to fit the model, the credible interval here is the percentile range).

So this is telling us that a good guess for the value of $\mu$ is 10.28, and that we are pretty certain that we're not far off. We're 95% sure that $\mu$ lies somewhere in the interval [10.13,10.44]. Similarly, we can guess that $\sigma$ is between 1.00 and 1.22, with 1.11 making a good single estimate. Note: because we chose the subsample of the data randomly, your estimates will be different.

Let's see how a normal curve built from those point estimates fits the data:

Again we see that a Gaussian model isn't a perfect fit for our data, but it comes close enough that we can probably trust its estimates for the mean and standard deviation of the population data.

Using random samples to talk about uncertainty

One useful way to quantify our uncertainty is to draw random samples from the posterior distribution. This is straightforward using the fit in fit_map.

We can use these samples to recreate the credible intervals reported by precis(). Note, however, that that the two methods use different approximate procedures to calculate the intervals. precis approximates the posterior with a Gaussian distribution, and then calculates exact percentiles for that approximation. Here, we take random samples from the (also approximate) posterior and use those to estimate the percentiles. The two methods should produce very similar results.

How can we visualize the uncertainty in this model? One way is to reproduce the previous figure (comparing our emprical distribution to the best-guess normal approximation) but using a collection of samples of $\mu$ and $\sigma$ from our posterior instead of the mean $\mu$ and $\sigma$.

One trick in the code below is setting the color of the lines built from the samples using col='#00000020'. This looks really cryptic, but is essentially telling R to make the lines 'translucent'. This way, when they stack on top of each other it is easier to see where there are many of them all in the same spot.