## ECON 370 LAB 6:  LASSO 
## NAME:  
## DATE:  


# step 0: preliminaries ----------------------------------------------------------------

## install needed packages, libraries, etc.


# step 1: load data --------------------------------------------------------------------

## load lab6data from the ECON 370 github page, the file ECON370-lab6-data.csv
## this is data on N = 200 children included in the EMERGE study
## familiarize yourself with the data



# step 2: prepare data ---------------------------------------------------------------

## define Y as the literacy column from lab5data



## enumerator and strata are IDs for the surveyor and the randomization stratum
## generate dummy variables for these (fundamentally categorical) variables

## R hint: I do this by converting a variable to character format w/ as.character()
##     and then using x_dummies <- model.matrix(~x - 1, df) to define dummies for 
##     the set of values of the variable x in the data frame df

## Python hint: use pd.get_dummies()



## define a data frame X that combines lab6data and the strata and enumerator dummies
## drop literacy (your Y variable) and (character variables) enumerator and strata 
## R users, make X a matrix (so that you can run lasso)




# step 3: OLS ---------------------------------------------------------------

## run an OLS regression of Y on X
## which variables are statistically significant predictors of literacy (95% level)?
## which variable has the lowest p-value? 
## what is the OLS coefficient associated with that variable?



# step 4: lasso and ridge regression -----------------------------------------

# step 4a (Python only):  rescaling the Xs -----------------------------------

## Python users: use scikit-learn's StandardScaler() to rescale the X variables
## (since ridge and lasso are not scale invariant)
## Save the names of the columns of X as a data frame X_names

## R users:  skip this step (R does this automatically)

# step 4b:  actually fitting ridge and lasso ---------------------------------

## Here is code for estimating a ridge regression with a very low value 
## of the tuning parameter (lambda in lecture, ISL, and R; alpha in Python)

## How does the ridge coefficient on the variable with the lowest OLS p-value
##    compare to the OLS coefficient?

## Python users: you need to convert the coefficients back to the original scale
##    by dividing them by scaler.scale_ (just ignore the intercept)
## Save them in a data frame with the names in X_names


## Estimate a ridge regression with a higher tuning/penalty parameter of 1
## Make sure to set the seed immediately before estimating the model
## How does the coefficient on the variable of interest (from above) change?



## Now estimate lasso by setting the alpha (R) or l1_ratio (Python) to 1
## Set the tuning parameter back to 0.0001
## Which variables are included in the model?



## Now estimate lasso with a tuning parameter of 1
## Which variables are included in the model now?




# step 5: cross-validation -----------------------------------------

## cv.glmnet() in R and LassoCV in Python estimate cross-validated lasso

## define a grid of tuning parameter values to try 

## set the seed and fit cross-validated lasso

## R users:
##    cv.glmnet takes the same arguements as glmnet
##    use grid as your lambda
##    define lass_cv as the lasso model that you fit
##    type plot(lasso_cv) afterward to see MSE as a function of lambda


## Python users:
##    LassoCV takes alphas as an argument instead of alpha, use your grid for that
##    set cv to 10 (the number of folds)
##    set the seed by setting the random_state argument
##    to make a scatter plot of your results, use lasso_cv.alphas_ as x
##       and np.mean(lasso_cv.mse_path_, axis=1) as y
##       and add a vertical line at lasso_cv.alpha_ (not alphas_)
##       to see where the test MSE is minimized

## save your graph as a pdf 



# step 6: the tuning parameter that minimizes test MSE --------------------------

## What value of the tuning parameter minimizes test MSE? 
## You can access this parameter using lasso_cv$lambda.min in R or lasso_cv.alpha_ in Python



## At this value of the tuning parameter, which variables are included in the model?
## R users:  use predict(lasso_cv, type = "coefficients", s = [your lambda value]) 
## Python users:  the coefficients are in lasso_cv.coef_, inlcuded if non-zero



# step 7: the 1SE tuning parameter --------------------------------------------

## which variables are included in the model if you use a tuning parameter that is 
##     1 SE higher than the MSE-minimizing one? 

## R users:  this value is stored as lasso_cv$lambda.1se

## Python users: as far as I can tell, scikit-learn doesn't do this, 
##    so you need to do it by hand (please, prove me wrong!)
## lasso_cv.mse_path_ is an N (observations) by k (CV folds) 
##    matrix of test MSE values
## calculate the mean test MSE across folds, 
## then find the index of the row with the minimum MSE
## the value of lasso_cv.alphas_ at this index should be lasso_cv.alpha_
## calculate the SD of the within-fold test MSEs at this index 
##    (ie the standard deviation of lasso_cv.mse_path_[mse_best_index])
## the SE of the CV estimate of the MSE is this SD / sqrt(# folds))
## add the SE to the mean test MSE at this MSE-minimizing value
## the 1SE tuning parameter is largest value that yields a test MSE below that sum

## use the code from step 4 to estimate lasso with this tuning parameter



# step 8: data-driven lasso of Belloni et al. -----------------------

## R users: you can access the data-driven lasso model using rlasso()
## the formula is Y ~ X, and set post = FALSE
## use summary() to look at the results
## which variables are included in the model?

## Python users skip this step 
## (AFAIK, Belloni et al. data-driven lasso is not available in Python)


# step 9 (optional): adding noise variables -----------------------------------

## add 50 additional X variables that are iid standard normals
## run lasso
## how many are chosen using the CV-selected tuning parameter?