In this exercise, we’ll be learning how to randomly assign treatment status in a way that is transparent and reproducible. After assigning treatments, we’ll check whether we’ve succeeded in creating a treatment group and a control group that are comparable in terms of their observable characteristics.
You can access the in-class activity as a pdf.
You can also access the empirical exercise as a pdf.
Start by creating a new do file that runs the following Stata code:
clear
set obs 4
gen id = _n
gen rand_num = rnormal()
sort rand_num
egen treatment = seq(), from(0) to(1)
sort id
What happens when you run the code? Use Stata’s data editor to view the (very small) data set you created. Which ID numbers are assigned to treatment? Run the code several times. Are the same ID numbers assigned to treatment each time?
The code above contains the three key parts of every randomization do file:
The idea behind random assignment is that we can generate a variable
using Stata’s pseudo-random number generator and then sort the data set based on that
variable; when we do this, the observations in the data set are listed in a
random order. If we want to randomly assign observations to treatment and comparison groups,
we can assign every other observation to treatment - after we’ve sorted them based on
our random x
variable.
The command
egen treatment = seq(), from(0) to(1)
generates a repeating sequence from 0 to 1: the first row (observation) in the data set will get a 0, the second row will get a 1, the third row will get a 0, and so on. Familiarize yourself with this command. How might you assign observations in your data set to four different treatment groups?
In the example above, we failed to set the seed, so each time we run our code, we get a completely new random treatment assignment. Insert the command
set seed 1234
between clear
and set obs 4
. This will guarantee that Stata uses the
same sequence of pseudo-random numbers every time you run the file (you can also set the version
for additional confidence in your code’s reproducibility). Run the file
a few times to confirm that this is the case.
We’re going to use the same data set on potential microfinance clients in urban India that we worked with in Empirical Exercise 6. The data set comes from the paper The Miracle of Microfinance? Evidence from a Randomized Evaluation by Abhijit Banerjee, Esther Duflo, Rachel Glennerster, and Cynthia Kinnan. The authors worked with an Indian MFI (microfinance institution) called Spandana that was expanding into the city of Hyderabad. Spandana identified 104 neighborhoods where it would be willing to open branches. They couldn’t open branches in all the neighborhoods simultaneously, so they worked with the researchers to assign half of them to a treatment group where branches would be opened immediately. Spandana held off on opening branches in the control neighborhoods until after the study.
The data set contains information on 6,853 households. Suppose you want to work with a local NGO to offer business training and mentoring to microentrepeneurs, and you want to stratify treatment assignments by treatment status in the Spandana RCT to see whether impacts depend on the availability of microcredit.
Create a do file that reads the Spandana data from Empirical Exercise 6 into Stata. Then extend your do file so that it randomizes treatment assignments, as described below.
You want to stratify treatment assignments in your evaluation by four variables:
Construct stratification cells based on these four variables.
Now randomly assign the households in the sample to treatment and control, stratifying by the four variables described above.
Once you have randomly assigned treatment status, we will typically want to check whether our treatment and comparison groups look similar in terms of observable characteristics. Make a balance check table that reports, for each of a set of covariates,
To do this, you can adapt the Stata program that you wrote for Empirical Exercise 8. Report tests for balance for each of your stratification variables plus the variables capturing whether a household operates a business (as of Endline 1), the number of household businesses, business assets, business revenues, business expenses, business profits. Save a copy of your finished balance check table as a pdf so that you can upload it to gradescope.
If you tested 1,000 baseline covariates for balance, how many of those variables would you expect to be imbalanced enough that you could reject the hypothesis that the mean in the treatment group was equal to be mean in the control group at the 95 percent confidence level?
This exercise is part of the module Randomization in Practice.