R sandbox

photo: World Bank/Peter Kapuscinski (2015)

Instructor:
Pamela Jakiela

home
syllabus
schedule
readings
lectures
stata


Empirical Exercise 2, Part 2

In this exercise, we’ll use R’s rnorm() function to generate draws from a normally-distributed random variable. This approach - simulating data according to a known data-generating process - is an incredibly useful tool in empirical microeconomics (both for checking your econometric intuitions and your anlayis code).

We’ll use variables to easily change the number of observations and other parameters of our data set. This will allow us to explore the properties of randomly-assigned treatment groups in larger and smaller samples.

Please upload your answers to gradescope after completing the exercise. You can also download the entire activity as an R Script.


Getting Started

First, you’ll start an R Script, putting the following command at the top and running it:

set.seed(523)

This will ensure that the rnorm() function (which, as mentioned, randomly samples from a normal distribution) will generate the same numbers every time you run your code. If you don’t do this, all of your answers will be wrong!

Now we are going to generate a data set that contains 500 observations of two normally-distributed variables, y and z. We’ll start by defining a variable myobs that indicates the number of observations we want in our data set. We can use a variable in an R Script file to set a parameter (like the number of observations, or the name of our dependent variable) that will be used repeatedly throughout the program. We can define the variable at the top of the R Script, and then - if we want to change it - we need only update the script in a single place. Note that you’ll need to run the line on which you define the value of the variable every time you change it for the changes take effect throughout the script.

Here, we define myObs, setting it equal to 500. If we want to see that we’ve defined the variable correctly, we can just type myObs into the script and run it.

# define a variable to indicate the number of observations
myObs <- 500

Now, we’ll use R’s rnorm() function to create a variable, y, that is normally-distributed with mean zero and variance one (i.e. a standard normal). We can also scale the a standard normal to create a variable with a different mean and/or variance.

# define some variables
y <- rnorm(myObs)
z <- 5 * rnorm(myObs) + 10

Use the summary() command to familiarize yourself with y and z. What is the estimated mean of each variable? What is the estimated standard deviation? What is the standard error associated with the estimate of the mean of each variable?

Use the hist() function to plot a histogram of each variable. Does this look like a normal distribution? Rerun your do file, changing the number of observations from 500 to 50,000. How do the histograms of y and z change as you increase the sample size? What happens to the estimates of the mean, the standard deviation, and the standard error of the sample mean as you increase the sample?


Empirircal Exercise

Set the sample size back to 500 and rerun your code.

Question 1

What is the mean of z?

Question 2

Use the command mean_z = mean(z) to generate a new variable equal to the mean of z. What is the standard deviation of your new variable, mean_z? Think about why this might be the case.

Question 3

Generate another variable, diff_z, equal to the difference between z and mean_z. What is the mean of this variable?

Question 4

Generate yet another new variable, this one equal to diff_z squared. Call this variable diff2_z. Now use the code below to calculate the sum of diff2_z across all observations, and to transform that sum into the standard deviation of z by dividing by the number of observations and then taking the square root.

sd_z <- sum(diff2_z)
sd_z <- sd_z / (myobs - 1)
sd_z <- sqrt(sd_z)

What is the value of sd_z? It should be nearly identical to the standard deviation of z as reported by the sd() function.

Question 5

Our estimator of the population mean of z is the sample mean of z, and the standard error of that estimator is the sample standard deviation (that you calculated above) divided by the square root of the number of observations in your sample. Write a line of code to generate a new variable, se_mean_z, equal to the standard error of the mean of z.

Question 6

What is the standard error of the mean of z? Confirm that the answer generated by the code you wrote for Question 5 is the same as the answer you’d get from the t.test(z)$stderr command.

Question 7

What happens when we randomly assign treatment? Random assignment should generate two groups (a treatment group and a control group) that look similar in terms of their observable characteristics. We can use the code below to assign half the observations in our sample to a treatment group.

# assign half the sample (observations 1 to myObs / 2) to treatment 
treatment <- c(rep(1, floor(myObs / 2)), rep(0, ceiling(myObs / 2)))

If we were randomly assigning treatment in a data set that we had not just generated, we would first want to sort our data into a random order - but here, that is not necessary since y and z are randomly-generated to begin with.

Now that you have generated a treatment dummy variable, test the hypothesis that the mean of z is the same in the treatment group and the comparison group. What is the p-value associated with this hypothesis test?