# this command tells jupyter what dimensions to use for its plots.
options(repr.plot.width=12, repr.plot.height=8)
To include a categorical variable in a regression, you most often need to construct a series of indicator variables for all but one of the categories contained in that variable. Many of the analystical functions in R (such as lm()
, which performs an OLS linear regression) will do this conversion automatically, trying to pick a good value for the reference category. But very often you will need to make indicator variables yourself. We'll look at ways to do that here.
We will start with a slightly different dataset than last week. It is from the same database, but is a much smaller sample, from 2016 rather than 2017, and contains several additional variables.
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2016.csv")
summary(d)
id family_size children_in_hh female Min. : 1545 Min. : 1.000 Min. :0.0000 Min. :0.0000 1st Qu.: 807813 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.:0.0000 Median :1592909 Median : 2.000 Median :0.0000 Median :1.0000 Mean :1598862 Mean : 2.831 Mean :0.6637 Mean :0.5104 3rd Qu.:2392786 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000 Max. :3156468 Max. :16.000 Max. :7.0000 Max. :1.0000 age marital_status times_married race Min. :15.00 Length:3313 Min. :0.0000 Length:3313 1st Qu.:32.00 Class :character 1st Qu.:0.0000 Class :character Median :47.00 Mode :character Median :1.0000 Mode :character Mean :47.68 Mean :0.9119 3rd Qu.:61.00 3rd Qu.:1.0000 Max. :96.00 Max. :3.0000 hispanic hispanic_origin health_ins health_ins_priv Min. :0.00000 Length:3313 Min. :0.0000 Min. :0.0000 1st Qu.:0.00000 Class :character 1st Qu.:1.0000 1st Qu.:0.0000 Median :0.00000 Mode :character Median :1.0000 Median :1.0000 Mean :0.05192 Mean :0.9107 Mean :0.7114 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 health_ins_employer in_school ed_highschool ed_college Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 Median :1.000 Median :0.0000 Median :1.0000 Median :0.0000 Mean :0.569 Mean :0.1087 Mean :0.8952 Mean :0.1155 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 NA's :49 NA's :49 income poverty occ_income log_income Min. : 4 Mode :logical Min. : 3.00 Min. : 1.386 1st Qu.: 13000 FALSE:2911 1st Qu.:20.00 1st Qu.: 9.473 Median : 28000 TRUE :340 Median :25.00 Median :10.240 Mean : 44506 NA's :62 Mean :27.69 Mean :10.106 3rd Qu.: 55000 3rd Qu.:35.00 3rd Qu.:10.915 Max. :747000 Max. :80.00 Max. :13.524 NA's :643
We will start with the construction of simple, binary indicators. These are variables that have only two categories: married versus not married; over 35 years old versus 35 years old and younger. You can construct these variables with simple logical tests in R.
# anyone with age > 35
# `d$age > 35` gives us a vector of TRUE and FALSE values
# `as.numeric()` converts that to a vector of 1s and 0s
d$over_35 <- as.numeric(d$age > 35)
# note: another way you may see this done is with the `ifelse()` function:
#d$over_35 <- ifelse(d$age > 35, 1, 0)
d$age_20_30 <- as.numeric(d$age >=20 & d$age <=30)
# plot age against our new variable
# (the `jitter` function adds some random noise to a variable,
# which can help variables with only a few, discrete values
# easier to read on a plot)
plot(jitter(d$age),jitter(d$age_20_30,.3),pch=16,cex=.7,col='#00000033')
# compare to the same plot without jitter
plot(d$age,d$age_20_30,pch=16,cex=.7,col='#00000033')
Now we will construct a single married
variable from the marital_status
variable.
# what are the current categories in `marital_status`
table(d$marital_status)
Divorced Married, spouse absent Married, spouse present 416 90 1576 Never married/single Separated Widowed 965 67 199
# we want to find respondents who are "Married, spouse absent" or "Married, spouse present"
married_categories <- c("Married, spouse absent","Married, spouse present")
# the speial `%in%` operator asks whether values in the first argument are
# present anywhere in the second argument
d$married <- as.numeric(d$marital_status %in% married_categories)
# compare the two variables to make sure it was right
table(d$marital_status, d$married)
0 1 Divorced 416 0 Married, spouse absent 0 90 Married, spouse present 0 1576 Never married/single 965 0 Separated 67 0 Widowed 199 0
Very often, we want to keep all of the categories in a variable like marital_status
for use in a regression. To do this, you need to first pick one reference category against which the others will be compared. You then construct indicator variables for each of the other categories.
We will do this for marital_status
. There is no one obvious choice for a reference category here, and it often depends on your research question. Are we interested in how things are distinctive for people who have never been married? Then "Never married/single" is probably a good reference category. Or are we concerned with how the job market is different for people in a cohabitating, "nuclear" family situation than it is for people in other living situations? In that case "Married, spouse present" makes more sense as a reference category. In general, it is usually a good idea for the reference category to be well represented in the data, so using "Separated" as the reference here (only 67 cases in our data) is probably not appropriate.
We will use "Never married/single" as our reference here.
# For a variable with only a few categories, it's often easiest to construct
# each indicator on its own:
d$m_divorced <- as.numeric(d$marital_status=='Divorced')
d$m_spouse_absent <- as.numeric(d$marital_status=='Married, spouse absent')
d$m_spouse_present <- as.numeric(d$marital_status=='Married, spouse present')
d$m_separated <- as.numeric(d$marital_status=='Separated')
d$m_widowed <- as.numeric(d$marital_status=='Widowed')
# use the `unique()` function to make sure we got this right
unique(d[c('marital_status','m_divorced','m_spouse_absent','m_spouse_present','m_separated','m_widowed')])
# how this works:
# first: `d[c('a','b')]` returns the data frame `d`, bun only with columns named 'a' and 'b'
# second: `unique(d)` shows only rows in `d` that are unique (dropping repeated, identical rows)
marital_status | m_divorced | m_spouse_absent | m_spouse_present | m_separated | m_widowed | |
---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | Married, spouse present | 0 | 0 | 1 | 0 | 0 |
2 | Never married/single | 0 | 0 | 0 | 0 | 0 |
3 | Widowed | 0 | 0 | 0 | 0 | 1 |
5 | Separated | 0 | 0 | 0 | 1 | 0 |
8 | Divorced | 1 | 0 | 0 | 0 | 0 |
53 | Married, spouse absent | 0 | 1 | 0 | 0 | 0 |
Now that we have indicator variables, it is easy to include them in a regression:
library(rethinking)
mymodel <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_35*over_35 + b_divorced*m_divorced + b_spouse_absent*m_spouse_absent +
b_spouse_present*m_spouse_present + b_separated*m_separated + b_widowed*m_widowed,
a ~ dnorm(0,20),
b_35 ~ dnorm(0,20),
b_divorced ~ dnorm(0,20),
b_spouse_absent ~ dnorm(0,20),
b_spouse_present ~ dnorm(0,20),
b_separated ~ dnorm(0,20),
b_widowed ~ dnorm(0,20),
sigma ~ dunif(0,25)
)
mfit <- quap(mymodel,data=d)
precis(mfit,prob=.95)
Loading required package: rstan Loading required package: StanHeaders Loading required package: ggplot2 rstan (Version 2.21.2, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Loading required package: parallel rethinking (Version 2.13) Attaching package: ‘rethinking’ The following object is masked from ‘package:stats’: rstudent
mean | sd | 2.5% | 97.5% | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | |
a | 9.45602114 | 0.04030230 | 9.37703008 | 9.5350122 |
b_35 | 0.42485766 | 0.05228474 | 0.32238145 | 0.5273339 |
b_divorced | 0.37668010 | 0.07475174 | 0.23016939 | 0.5231908 |
b_spouse_absent | 0.02111592 | 0.12883972 | -0.23140529 | 0.2736371 |
b_spouse_present | 0.63178493 | 0.05501536 | 0.52395681 | 0.7396130 |
b_separated | 0.35985627 | 0.14801948 | 0.06974342 | 0.6499691 |
b_widowed | 0.02170162 | 0.09710952 | -0.16862954 | 0.2120328 |
sigma | 1.15573581 | 0.01419809 | 1.12790807 | 1.1835636 |
Often, visualizing a table like this can help. rethinking
has a built-in 'method' for plotting the output of the precis()
function. It produces a forest plot—a very common way of representing estimates in Bayesian analysis:
plot(
precis(
mfit,prob=.95,
pars = c('b_35','b_divorced','b_spouse_absent','b_spouse_present','b_separated','b_widowed')
)
)
Standardizing variables is often vital for regression analysis. Standardized variables have zero mean and unit standard deviation, which can help immensely in the interpretation of regression coefficients.
It is simple to standardize a variable: simply subtract the variable's mean and divide the result by the variable's standard deviation.
d$age.s <- (d$age - mean(d$age))/sd(d$age)
plot(density(d$age.s))
R has a built-in function scale()
that will do this for you.
d$occ_income.s <- scale(d$occ_income)
plot(density(na.omit(d$occ_income.s)))