## ECON 370 LAB 6: LASSO ## NAME: ## DATE: # step 0: preliminaries ---------------------------------------------------------------- ## install needed packages, libraries, etc. # step 1: load data -------------------------------------------------------------------- ## load lab6data from the ECON 370 github page, the file ECON370-lab6-data.csv ## this is data on N = 200 children included in the EMERGE study ## familiarize yourself with the data # step 2: prepare data --------------------------------------------------------------- ## define Y as the literacy column from lab5data ## enumerator and strata are IDs for the surveyor and the randomization stratum ## generate dummy variables for these (fundamentally categorical) variables ## R hint: I do this by converting a variable to character format w/ as.character() ## and then using x_dummies <- model.matrix(~x - 1, df) to define dummies for ## the set of values of the variable x in the data frame df ## Python hint: use pd.get_dummies() ## define a data frame X that combines lab6data and the strata and enumerator dummies ## drop literacy (your Y variable) and (character variables) enumerator and strata ## R users, make X a matrix (so that you can run lasso) # step 3: OLS --------------------------------------------------------------- ## run an OLS regression of Y on X ## which variables are statistically significant predictors of literacy (95% level)? ## which variable has the lowest p-value? ## what is the OLS coefficient associated with that variable? # step 4: lasso and ridge regression ----------------------------------------- # step 4a (Python only): rescaling the Xs ----------------------------------- ## Python users: use scikit-learn's StandardScaler() to rescale the X variables ## (since ridge and lasso are not scale invariant) ## Save the names of the columns of X as a data frame X_names ## R users: skip this step (R does this automatically) # step 4b: actually fitting ridge and lasso --------------------------------- ## Here is code for estimating a ridge regression with a very low value ## of the tuning parameter (lambda in lecture, ISL, and R; alpha in Python) ## How does the ridge coefficient on the variable with the lowest OLS p-value ## compare to the OLS coefficient? ## Python users: you need to convert the coefficients back to the original scale ## by dividing them by scaler.scale_ (just ignore the intercept) ## Save them in a data frame with the names in X_names ## Estimate a ridge regression with a higher tuning/penalty parameter of 1 ## Make sure to set the seed immediately before estimating the model ## How does the coefficient on the variable of interest (from above) change? ## Now estimate lasso by setting the alpha (R) or l1_ratio (Python) to 1 ## Set the tuning parameter back to 0.0001 ## Which variables are included in the model? ## Now estimate lasso with a tuning parameter of 1 ## Which variables are included in the model now? # step 5: cross-validation ----------------------------------------- ## cv.glmnet() in R and LassoCV in Python estimate cross-validated lasso ## define a grid of tuning parameter values to try ## set the seed and fit cross-validated lasso ## R users: ## cv.glmnet takes the same arguements as glmnet ## use grid as your lambda ## define lass_cv as the lasso model that you fit ## type plot(lasso_cv) afterward to see MSE as a function of lambda ## Python users: ## LassoCV takes alphas as an argument instead of alpha, use your grid for that ## set cv to 10 (the number of folds) ## set the seed by setting the random_state argument ## to make a scatter plot of your results, use lasso_cv.alphas_ as x ## and np.mean(lasso_cv.mse_path_, axis=1) as y ## and add a vertical line at lasso_cv.alpha_ (not alphas_) ## to see where the test MSE is minimized ## save your graph as a pdf # step 6: the tuning parameter that minimizes test MSE -------------------------- ## What value of the tuning parameter minimizes test MSE? ## You can access this parameter using lasso_cv$lambda.min in R or lasso_cv.alpha_ in Python ## At this value of the tuning parameter, which variables are included in the model? ## R users: use predict(lasso_cv, type = "coefficients", s = [your lambda value]) ## Python users: the coefficients are in lasso_cv.coef_, inlcuded if non-zero # step 7: the 1SE tuning parameter -------------------------------------------- ## which variables are included in the model if you use a tuning parameter that is ## 1 SE higher than the MSE-minimizing one? ## R users: this value is stored as lasso_cv$lambda.1se ## Python users: as far as I can tell, scikit-learn doesn't do this, ## so you need to do it by hand (please, prove me wrong!) ## lasso_cv.mse_path_ is an N (observations) by k (CV folds) ## matrix of test MSE values ## calculate the mean test MSE across folds, ## then find the index of the row with the minimum MSE ## the value of lasso_cv.alphas_ at this index should be lasso_cv.alpha_ ## calculate the SD of the within-fold test MSEs at this index ## (ie the standard deviation of lasso_cv.mse_path_[mse_best_index]) ## the SE of the CV estimate of the MSE is this SD / sqrt(# folds)) ## add the SE to the mean test MSE at this MSE-minimizing value ## the 1SE tuning parameter is largest value that yields a test MSE below that sum ## use the code from step 4 to estimate lasso with this tuning parameter # step 8: data-driven lasso of Belloni et al. ----------------------- ## R users: you can access the data-driven lasso model using rlasso() ## the formula is Y ~ X, and set post = FALSE ## use summary() to look at the results ## which variables are included in the model? ## Python users skip this step ## (AFAIK, Belloni et al. data-driven lasso is not available in Python) # step 9 (optional): adding noise variables ----------------------------------- ## add 50 additional X variables that are iid standard normals ## run lasso ## how many are chosen using the CV-selected tuning parameter?