---
title: "SOCI 620: Lab 5"
author: "Peter McMahan"
date: "1/24/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
# Creating indicator (dummy) variables
To include a categorical variable in a regression, you most often need to construct a series of indicator variables for all but one of the categories contained in that variable. Many of the analystical functions in R (such as `lm()`, which performs an OLS linear regression) will do this conversion automatically, trying to pick a good value for the reference category. But very often you will need to make indicator variables yourself. We'll look at ways to do that here.
We will start with a slightly different dataset than last week. It is from the same database, but is a much smaller sample, from 2016 rather than 2017, and contains several additional variables.
```{r get_data}
d <- read.csv("https://soci620.netlify.com/data/USincome_subset_2016.csv")
summary(d)
```
## Binary categories
We will start with the construction of simple, binary indicators. These are variables that have only two categories: married versus not married; over 35 years old versus 35 years old and younger. You can construct these variables with simple logical tests in R.
```{r binary}
# anyone with age > 35
# `d$age > 35` gives us a vector of TRUE and FALSE values
# `as.numeric()` converts that to a vector of 1s and 0s
d$over_35 <- as.numeric(d$age > 35)
# note: another way you may see this done is with the `ifelse()` function:
d$over_35 <- ifelse(d$age > 35, 1, 0)
# plot age against our new variable
# (the `jitter` function adds some random noise to a variable,
# which can help variables with only a few, discrete values
# easier to read on a plot)
plot(d$age, jitter(d$over_35,.3),
pch=16,cex=.7,col='#00000033')
abline(v=35)
# compare to the same plot without jitter
plot(d$age,d$over_35,
pch=16,cex=.7,col='#00000033')
```
Now we will construct a single `married` variable from the `marital_status` variable.
```{r married_examine}
# what are the current categories in `marital_status`
# one with raw output
table(d$marital_status)
```
```{r build_married_binary}
# we want to find respondents who are "Married, spouse absent" or "Married, spouse present"
married_categories <- c("Married, spouse absent","Married, spouse present")
# the speial `%in%` operator asks whether values in the first argument are
# present anywhere in the second argument
d$married <- as.numeric(d$marital_status %in% married_categories)
# compare the two variables to make sure it was right
table(d$marital_status, d$married)
```
## Variables with more than two categories
Very often, we want to keep all of the categories in a variable like `marital_status` for use in a regression. To do this, you need to first pick one *reference category* against which the others will be compared. You then construct indicator variables for each of the other categories.
We will do this for `marital_status`. There is no one obvious choice for a reference category here, and it often depends on your research question. Are we interested in how things are distinctive for people who have never been married? Then "Never married/single" is probably a good reference category. Or are we concerned with how the job market is different for people in a cohabitating, "nuclear" family situation than it is for people in other living situations? In that case "Married, spouse present" makes more sense as a reference category. In general, it is usually a good idea for the reference category to be well represented in the data, so using "Separated" as the reference here (only 67 cases in our data) is probably not appropriate.
We will use "Never married/single" as our reference here.
```{r build_married_cat}
# For a variable with only a few categories, it's often easiest to construct
# each indicator on its own:
d$m_divorced <- as.numeric(d$marital_status=='Divorced')
d$m_spouse_absent <- as.numeric(d$marital_status=='Married, spouse absent')
d$m_spouse_present <- as.numeric(d$marital_status=='Married, spouse present')
d$m_separated <- as.numeric(d$marital_status=='Separated')
d$m_widowed <- as.numeric(d$marital_status=='Widowed')
```
```{r check_married_cat}
# use the `unique()` function to make sure we got this right
unique(d[c('marital_status','m_divorced','m_spouse_absent','m_spouse_present','m_separated','m_widowed')])
# how this works:
# first: `d[c('a','b')]` returns the data frame `d`, bun only with columns named 'a' and 'b'
# second: `unique(d)` shows only rows in `d` that are unique (dropping repeated, identical rows)
```
## Including these in a regression
Now that we have indicator variables, it is easy to include them in a regression:
```{r regression}
model1 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_35*over_35 + b_divorced*m_divorced + b_spouse_absent*m_spouse_absent +
b_spouse_present*m_spouse_present + b_separated*m_separated + b_widowed*m_widowed,
a ~ dnorm(0,20),
b_35 ~ dnorm(0,20),
b_divorced ~ dnorm(0,20),
b_spouse_absent ~ dnorm(0,20),
b_spouse_present ~ dnorm(0,20),
b_separated ~ dnorm(0,20),
b_widowed ~ dnorm(0,20),
sigma ~ dunif(0,25)
)
fit1 <- quap(model1,data=d)
precis(fit1,prob=.95)
```
Often, visualizing a table like this can help.
```{r visyualize_output}
# `precis()` creates a special kind of object
class(precis(fit1))
plot(
precis(fit1, prob = .95)
)
# or, to tell it which coefficients you want, give precis the `pars` argument
plot(
precis(
fit1,prob=.95,
pars = c('b_35','b_divorced','b_spouse_absent','b_spouse_present','b_separated','b_widowed')
)
)
```
# Standardizing variables
Standardizing variables is often vital for regression analysis. Standardized variables have zero mean and unit standard deviation, which can help immensely in the interpretation of regression coefficients.
It is simple to standardize a variable: simply subtract the variable's mean and divide the result by the variable's standard deviation.
```{r standardize1}
d$age.s <- (d$age - mean(d$age))/sd(d$age)
plot(density(d$age.s))
```
R has a built-in function `scale()` that will do this for you.
```{r standardize2}
d$occ_income.s <- scale(d$occ_income)
# the na.omit() function returns its argument with missing
# data removed. Use with care!
plot(density(na.omit(d$occ_income.s)))
```
Side note: if you hate nesting many functions as above, R also has a "pipe" operator (`|>`) that some people find more intuitive. The line below takes `d$occ_income`, "pipes" it through the `na.omit()` function, "pipes" the result of that through the `density()` function, and then "pipes" the final result through the `plot()` function. This is exactly equivalent to plot function in the previous code chunk.
```{r standardize2_pipes}
# side note: if you hate nesting many functions as above,
# R also hase a "pipe" operator "|>" that some people find
# more intuitive. The line below takes d$
d$occ_income |> na.omit() |> density() |> plot()
```
Let's try a regression with these standardized variables. But first, there's a problem! The `occ_income.s` variable has missing values (`NA`), and `quap()` will complain and refuse to give us an answer. We'll discuss the intricacies of missing data later, but for now we will just ignore any respondent with missing `occ_income.s`. The `complete.cases()` function is great for this. If you give it a data frame, it will return a logical (TRUE/FALSE) vector with value `FALSE` for any row with missing data. you can then use that vector to subset your data to only include complete observations.
```{r regression2}
# occ_income.s has missing values, so we first find
# rows with complete cases. We only ask about the variables
# we are using, because missing data in other variables
# won't affect this regression
comp <- complete.cases(d[c('log_income','age.s','occ_income.s')])
# define the model with an `alist`
model2 <- alist(
log_income ~ dnorm(mu,sigma),
mu <- a + b_a*age.s + b_o*occ_income.s + b_ao*age.s*occ_income.s,
a ~ dnorm(0,10),
b_a ~ dnorm(0,10),
b_o ~ dnorm(0,10),
b_ao ~ dnorm(0,10),
sigma ~ dunif(0,20)
)
# now we fit a model, using `data = d[comp,]` to only use complete rows
fit2 <- quap(model2, data = d[comp,])
precis(fit2, prob=.95)
```