In this exercise, we’ll be learning how to randomly assign treatment status in a way that is transparent and reproducible. After assigning treatments, we’ll check whether we’ve succeeded in creating a treatment group and a control group that are comparable in terms of their observable characteristics.
Start by creating script that runs the following code:
library(tidyverse)
datasize <- 4
data <- tibble(
id = 1:datasize,
rand_num = rnorm(datasize)
)
data <- arrange(data, rand_num)
data$treatment <- rep(0:1, length.out = datasize)
data <- arrange(data, id)
print(data)
What happens when you run the code? Which ID numbers are assigned to treatment? Run the code several times. Are the same ID numbers assigned to treatment each time?
The code above contains the three key parts of every randomization do file:
The idea behind random assignment is that we can generate a variable
using R’s pseudo-random number generator and then sort the data set based on that
variable; when we do this, the observations in the data set are listed in a
random order. If we want to randomly assign observations to treatment and comparison groups,
we can assign every other observation to treatment - after we’ve sorted them based on
our random x
variable.
The command
data$treatment <- rep(0:1, length.out = datasize)
generates a repeating sequence from 0 to 1: the first row (observation) in the data set will get a 0, the second row will get a 1, the third row will get a 0, and so on. How might you assign observations in your data set to four different treatment groups?
In the example above, we failed to set the seed, so each time we run our code, we get a completely new random treatment assignment. Insert the line
set.seed(8675309)
before you generate the random numbers. This will guarantee that Python uses the same sequence of pseudo-random numbers every time you run the code. Run the code a few times to confirm that this is the case.
We’re going to use the same data set on potential microfinance clients in urban India that we worked with in Empirical Exercise 6. The data set comes from the paper The Miracle of Microfinance? Evidence from a Randomized Evaluation by Abhijit Banerjee, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan. The authors worked with an Indian MFI (microfinance institution) called Spandana that was expanding into the city of Hyderabad. Spandana identified 104 neighborhoods where it would be willing to open branches. They couldn’t open branches in all the neighborhoods simultaneously, so they worked with the researchers to assign half of them to a treatment group where branches would be opened immediately. Spandana held off on opening branches in the control neighborhoods until after the study.
The data set contains information on 6,853 households. Suppose you want to work with a local NGO to offer business training and mentoring to microentrepeneurs, and you want to stratify treatment assignments by treatment status in the Spandana RCT to see whether impacts depend on the availability of microcredit.
Write a program that reads the Spandana data from Empirical Exercise 6 into R. Then extend your code
so that it randomizes treatment assignments, as described below.
You want to stratify treatment assignments in your evaluation by four variables:
Construct stratification cells based on these four variables.
Hint 1: drop observations that are missing values for any of the stratification variables.
Hint 2: the code
data <- data %>%
group_by(1, x2) %>%
mutate(group_id = cur_group_id()) %>%
ungroup()
generates ID numbers for the groups defined by all the observed combinations of the values of variables x1
and x2
.
Now randomly assign the households in the sample to treatment and control, stratifying by the four variables described above.
Once you have randomly assigned treatment status, we will typically want to check whether our treatment and comparison groups look similar in terms of observable characteristics. Make a balance check table that reports, for each of a set of covariates,
To do this, you can adapt the code that you wrote for Empirical Exercise 8. Report tests for balance for each of your stratification variables plus the variables capturing whether a household operates a business (as of Endline 1), the number of household businesses, business assets, business revenues, business expenses, business profits. Save a copy of your finished balance check table as a pdf so that you can upload it to gradescope.
If you tested 1,000 baseline covariates for balance, how many of those variables would you expect to be imbalanced enough that you could reject the hypothesis that the mean in the treatment group was equal to be mean in the control group at the 95 percent confidence level?
This exercise is part of the module Randomization in Practice.