---
title: "Assignment 2, SOCI 620, Winter 2022"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```
_Due Tues, Feb 1_
In this assignment, you will be looking at income inequality in a cross-national perspective. The data for the assignment is available from .
0. Load the data into R using the following commands. This data has a number of missing values, which we will ignore for the moment. The second line in the code below will remove any rows with missing values in the `latest_gini` or `colonized` columns. (A description of the variables and the data sources are listed at the bottom of this document).
```{r data_load}
d <- read.csv('https://soci620.netlify.app/data/dev_and_col.csv')
d <- d[!is.na(d$latest_gini) & !is.na(d$colonized),]
```
1. The Gini index is a measure of resource inequality in a population. A value of 0 would mean that each individual in the population has exactly the same amount of that resource, while a population in which one person had all of the resources would have a Gini value close to 100 (a Gini index of exactly 100 is impossible in a finite population). The variable `latest_gini` is the most recent World Bank's estimate the country-level inequality in income.
Create a density plot for `latest_gini` and describe what you observe.
```{r q1}
# (your code here)
gini_dens <- density(d$latest_gini)
plot(gini_dens)
```
There is quite a bit of variation in the Gini index, which ranges from about 20 to about 65 or 70. The bulk of the countries have a Gini index between about 30 and about 45.
2. Build a Gaussian model of income inequality that will estimate the average Gini index and the standard deviation in the Gini index across countries. Write out the full model, including all priors and any stochastic or deterministic relationships.
Then, estimate the model using the `quap()` function from the `rethinking` package. Describe the posterior distributions of the mean and standard deviation, including the mean of the marginal posteriors and 90% credible intervals. What do these mean in plain language?
$$
\begin{aligned}
G_i &\sim Norm(\mu,\sigma) \\
\mu &\sim Norm(50,20) \\
\sigma &\sim Unif(0,30)
\end{aligned}
$$
```{r q2}
# define the model
model1 <- alist(
latest_gini ~ dnorm(mu,sigma),
mu ~ dnorm(50,20),
sigma ~ dunif(0,30)
)
# fit with quap
fit1 <- quap(model1,data=d,start = startvals)
# print marginal distributions
# (use the `pander` package to make nicer output)
library(pander)
suppressWarnings(pander(precis(fit1,prob=0.9)))
```
The cross-national average Gini index is about 38.4, but could be as low as 37.4 or as high as 39.5 (90% credible interval). There is significant variation in this number across countries, with a posterior mean of the standard deviation of 7.9. The 90% credible interval on this standard deviation is [7.19, 8.63].
3. The variable `colonized` contains an indicator of whether the country "was a dependency ruled by a foreign power before achieving independence" (quote from original codebook of Hensel, 2018). We want to see if there is a systematic difference in income inequality between former colonies and countries that were never colonized.
Adapt your model from the last question to create a regression model allowing for different average inequalities for colonized versus never-colonized nations. Again, write out the full model, including all priors and any stochastic or deterministic relationships.
Estimate this model using `quap()`. Describe the posterior distributions the regression coefficients, including the mean of the marginal posteriors and 90% credible intervals. What do these mean in plain language?
$$
\begin{aligned}
G_i &\sim Norm(\mu_i,\sigma) \\
\mu_i &= \beta_0 + \beta_1 C_i \\
\beta_0 &\sim Norm(50,20) \\
\beta_1 &\sim Norm(0,10) \\
\sigma &\sim Unif(0,30)
\end{aligned}
$$
```{r q3}
# define the model
model2 <- alist(
latest_gini ~ dnorm(mu,sigma),
mu <- b0 + b1*colonized,
b0 ~ dnorm(50,20),
b1 ~ dnorm(0,10),
sigma ~ dunif(0,30)
)
# fit with quap
fit2 <- quap(model2,data=d)
# print marginal distributions
suppressWarnings(pander(precis(fit2,prob=0.9)))
```
The predicted Gini index for non-colonized countries is about 34.84, with a 90% credible interval of [33.28, 36.40]. Colonized countries have on average a Gini index 5.70 points higher than non-colonized countries, though that difference could be as low as 3.74 or as high as 7.66.
4. Use the `extract.samples()` function to draw a few thousand samples from the joint posterior of the regression model you just estimated. Use the sample to recreate the posterior mean and 90% credible intervals from the previous question. Do they match up?
Use the same sample to create a single figure comparing the posterior density of average inequality in for the two types of countries (former colonies and non-colonized countries). Do these posterior densities tell the same story as the results from the previous question?
```{r q3a}
samp2 <- extract.samples(fit2,n=8000)
# you can use `sapply()` to "apply" a function to
# every column in a data frame
post_mean <- sapply(samp2, mean)
post_pi <- sapply(samp2, quantile, probs=c(.05,.95))
pander(post_mean)
pander(post_pi)
```
The mean and credible intervals are substantially the same.
```{r q3b}
dens_noncol <- density(samp2$b0)
dens_col <- density(samp2$b0 + samp2$b1)
# get reasonable limits
xlim <- range(dens_noncol$x, dens_col$x)
ylim <- range(dens_noncol$y, dens_col$y)
# plot noncolonized and set up plot region
plot(dens_noncol,
xlim=xlim, ylim=ylim,
main = "", xlab="Mean Gini index", ylab="Posterior density")
# add colonized
lines(dens_col,lty=2)
# and a legend for good measure
legend('topleft',lty=1:2,legend=c('non-colonized','colonized'))
```
5. Create another regression model using a variable of your choice from the data set (see variable descriptions below). The new model should include both `colonized` and your new choice of variable as predictors. Estimate the model using `quap()`, and describe the marginal posterior distributions of your parameters. What does this expanded model tell you that the previous model did not?
_Note: Several of the variables in the data set have missing values. You may need to exclude rows from the dataset that have a missing value (`NA`) in the covariate you choose. See step (0) above for one way to do this._
We will add a covariate for (normalized) per-capita GDP ($P_i$):
$$
\begin{aligned}
G_i &\sim Norm(\mu_i,\sigma) \\
\mu_i &= \beta_0 + \beta_1 C_i + \beta_2*P_i \\
\beta_0 &\sim Norm(50,20) \\
\beta_1, \beta_2 &\sim Norm(0,10) \\
\sigma &\sim Unif(0,30)
\end{aligned}
$$
```{r q5}
# create a scaled gdp variable
d$pcgdp <- scale(d$gdp_percap_usd2010_2015)
# make a version of the dataset without missing values
# in the variables we use:
d3 <- d[c('latest_gini','colonized','pcgdp')]
d3 <- na.omit(d3)
# define the model
model3 <- alist(
latest_gini ~ dnorm(mu,sigma),
mu <- b0 + b1*colonized + b2*pcgdp,
b0 ~ dnorm(50,20),
b1 ~ dnorm(0,10),
b2 ~ dnorm(0,10),
sigma ~ dunif(0,30)
)
# fit with quap
fit3 <- quap(model3,data=d3)
# print marginal distributions
suppressWarnings(pander(precis(fit3,prob=0.9)))
```
### Data description ###
-------------------------------------------------------------------------------
Variable Description Source
-------------------------------- -------------------------------- -------------
`country_name` Name of country World Bank
`country_code` Three-letter country code World Bank
`latest_gini` Most recent World bank estimate World Bank
of income inequality since 2010
(Gini index)
`latest_gini_year` Year of Gini index estimate World Bank
`population_2015` Total population (2015) World Bank
`population_growth_2015` Percent population growth (2015) World Bank
`population_density_2015` Population per sq. km (2015) World Bank
`population_pct_urban_2015` Percent urban pop. (2015) World Bank
`population_pct_immigrant_2015` Percent immigrant pop. (2015) World Bank
`gdp_usd2010_2015` GDP in 2010 USD (2015) World Bank
`gdp_percap_usd2010_2015` GDP per capita in 2010 USD World Bank
(2015)
`fertility_rate_2015` Births per woman (2015) World Bank
`independence_year` Year of national independence ICOW
`independece_from` Country Independence won from ICOW
`independence_violent` Violent independence indicator ICOW
`colonized` Was colony before independence ICOW
`primary_colonial_ruler` Primary colonial ruler ICOW
-------------------------------------------------------------------------------
[World Bank data from World Development Indicators database (accessed January 22, 2021).](https://databank.worldbank.org/reports.aspx?source=World-Development-Indicators)
[ICOW data from R. Hensel (2018). "ICOW Colonial History Data Set, version 1.1."](http://www.paulhensel.org/icowcol.html)