Creating indicator (dummy) variables

To include a categorical variable in a regression, you most often need to construct a series of indicator variables for all but one of the categories contained in that variable. Many of the analystical functions in R (such as lm(), which performs an OLS linear regression) will do this conversion automatically, trying to pick a good value for the reference category. But very often you will need to make indicator variables yourself. We'll look at ways to do that here.

We will start with a slightly different dataset than last week. It is from the same database, but is a much smaller sample, from 2016 rather than 2017, and contains several additional variables.

Binary categories

We will start with the construction of simple, binary indicators. These are variables that have only two categories: married versus not married; over 35 years old versus 35 years old and younger. You can construct these variables with simple logical tests in R.

Now we will construct a single married variable from the marital_status variable.

Variables with more than two categories

Very often, we want to keep all of the categories in a variable like marital_status for use in a regression. To do this, you need to first pick one reference category against which the others will be compared. You then construct indicator variables for each of the other categories.

We will do this for marital_status. There is no one obvious choice for a reference category here, and it often depends on your research question. Are we interested in how things are distinctive for people who have never been married? Then "Never married/single" is probably a good reference category. Or are we concerned with how the job market is different for people in a cohabitating, "nuclear" family situation than it is for people in other living situations? In that case "Married, spouse present" makes more sense as a reference category. In general, it is usually a good idea for the reference category to be well represented in the data, so using "Separated" as the reference here (only 67 cases in our data) is probably not appropriate.

We will use "Never married/single" as our reference here.

Including these in a regression

Now that we have indicator variables, it is easy to include them in a regression:

Often, visualizing a table like this can help. rethinking has a built-in 'method' for plotting the output of the precis() function. It produces a forest plot—a very common way of representing estimates in Bayesian analysis:

Standardizing variables

Standardizing variables is often vital for regression analysis. Standardized variables have zero mean and unit standard deviation, which can help immensely in the interpretation of regression coefficients.

It is simple to standardize a variable: simply subtract the variable's mean and divide the result by the variable's standard deviation.

R has a built-in function scale() that will do this for you.