SOCI 620: Quantitative methods 2

Agenda

Parsimony
& overfitting

  1. Administrative
  2. Parsimony & Occam’s Razor
  3. Overfitting vs. underfitting
  4. Test & training data
  5. Information criteria
  6. Hands on:
    Comparing information
    criteria in R

Slides are licensed under CC BY-NC-SA 4.0

Parsimony
& Occam’s razor

A man in an early 20th century mailroom outfit holding up a creased piece of paper with a simple circle drawn in the center.

Occam’s Razor

How many buildings?

Occam’s Razor

M1:
Four buildings

M2:
Five buildings

A-priori justification

Simpler models are easier to interpret or more compelling on their own

Model likelihood justification

Simpler models rely less on coincidence to produce specific data

Assessing fit

A man (David Byrne) standing in front of a blue background while wearing an absurdly large grey suit

Assessing fit

Assessing fit

Linear

Quadratic

A quadratic model seems like it might be a better fit.

But how can we measure that?

Assessing fit: deviance

Assessing fit: deviance

Deviance ()* is minus two times the log likelihood of the data, given the model and a point estimate for the model parameters ():

* Note: a common definition of deviance requires a comparison to a ‘saturated’ model. For clarity, we use this simpler definition.

Assessing fit: deviance

Goodness of fit

Underfit

  • Predictions err in systematic ways
  • Misses meaningful patterns in the relationship between predictor(s) and outcome

Overfit

  • Takes random variation to be systematic
  • Predicts cases in the sample well, but tends to predict new data very poorly

Overfitting

Test and training data

Training data

Fit the model on a subset of the data (e.g. 50%)

Test data

Asses model fit on the held-out portion of the data

Akaike information criterion (AIC)

Interpretation 1

Penalize deviance score for each added parameter by some ‘reasonable’ value.

Interpretation 2

Model the average difference in deviance between training and test data.

Assumptions:

  • Sample size ≫ number of parameters (k)
  • Posterior is approximately (multivariate) normal

Information criteria

Criterion

Fit

Penalty

Akaike Information Criterion (AIC)
Deviance at the MAP/ML estimate (usually)
#parameters
“Bayesian” Information Criterion (BIC)
Deviance at the MAP/ML estimate
#parameters × log(#observations)
Deviance Information Criterion (DIC)
Deviance averaged across posterior
“Effective” #parameters (posterior)
Widely Applicable Information Criterion (WAIC)
Deviance averaged across posterior and observations
“Effective” #parameters (posterior & obs.)

Using information criteria

Strategy 1

Pick the model with the lowest value

WAIC(M1) = 209.0; WAIC(M2) = 208.1
→ M2 is the winner

Strategy 2

Report several models along with values

Multi-model table showing estimates for different combinations of coefficients, along with WAIC

Strategy 3

Average predictions across models

Simultaneous posterior predictions of new data from all models, weighted by WAIC

Building models

Considerations when building a model
(i.e. choosing covariates)

Theoretical relevance

  • Independent variables chosen to address theoretical concerns
  • E.g. test theoretical predictions, account for theorized connections

Causal inference

  • Independent variables chosen to make robust causal claims
  • Worry about including confounders, omitting colliders, and thinking through role of moderating and mediating variables

Predictive accuracy

  • Independent variables chosen to maximize predictive power
  • Accuracy of out-of-sample predictions;
    Interpretation of models with many moving parts

Image credit

Figures by Peter McMahan (source code)

A man (David Byrne) standing in front of a blue background while wearing an absurdly large grey suit

David Byrne by Deborah Feingold