Deadline

Wednesday, November 22, 2017, 11:59PM

Important notes

  • When you push your Notebook to GitHub, all the cells must already have been evaluated.
  • Don’t forget to add a textual description of your thought process and of any assumptions you’ve made.
  • Please write all your comments in English, and use meaningful variable names in your code.

Question 1: Propensity score matching

In this exercise, you will apply propensity score matching, which we discussed in lecture 5 (“Observational studies”), in order to draw conclusions from an observational study.

We will work with a by-now classic dataset from Robert LaLonde’s study “Evaluating the Econometric Evaluations of Training Programs” (1986). The study investigated the effect of a job training program (“National Supported Work Demonstration”) on the real earnings of an individual, a couple of years after completion of the program. Your task is to determine the effectiveness of the “treatment” represented by the job training program.

Dataset description

  • treat: 1 if the subject participated in the job training program, 0 otherwise
  • age: the subject’s age
  • educ: years of education
  • race: categorical variable with three possible values: Black, Hispanic, or White
  • married: 1 if the subject was married at the time of the training program, 0 otherwise
  • nodegree: 1 if the subject has earned no school degree, 0 otherwise
  • re74: real earnings in 1974 (pre-treatment)
  • re75: real earnings in 1975 (pre-treatment)
  • re78: real earnings in 1978 (outcome)

If you want to brush up your knowledge on propensity scores and observational studies, we highly recommend Rosenbaum’s excellent book on the “Design of Observational Studies”. Even just reading the first chapter (18 pages) will help you a lot.

1. A naive analysis

Compare the distribution of the outcome variable (re78) between the two groups, using plots and numbers. To summarize and compare the distributions, you may use the techniques we discussed in lectures 4 (“Read the stats carefully”) and 6 (“Data visualization”).

What might a naive “researcher” conclude from this superficial analysis?

2. A closer look at the data

You’re not naive, of course (and even if you are, you’ve learned certain things in ADA), so you aren’t content with a superficial analysis such as the above. You’re aware of the dangers of observational studies, so you take a closer look at the data before jumping to conclusions.

For each feature in the dataset, compare its distribution in the treated group with its distribution in the control group, using plots and numbers. As above, you may use the techniques we discussed in class for summarizing and comparing the distributions.

What do you observe? Describe what your observations mean for the conclusions drawn by the naive “researcher” from his superficial analysis.

3. A propensity score model

Use logistic regression to estimate propensity scores for all points in the dataset. You may use sklearn to fit the logistic regression model and apply it to each data point to obtain propensity scores:

from sklearn import linear_model
logistic = linear_model.LogisticRegression()

Recall that the propensity score of a data point represents its probability of receiving the treatment, based on its pre-treatment features (in this case, age, education, pre-treatment income, etc.). To brush up on propensity scores, you may read chapter 3.3 of the above-cited book by Rosenbaum or this article.

Note: you do not need a train/test split here. Train and apply the model on the entire dataset. If you’re wondering why this is the right thing to do in this situation, recall that the propensity score model is not used in order to make predictions about unseen data. Its sole purpose is to balance the dataset across treatment groups. (See p. 74 of Rosenbaum’s book for an explanation why slight overfitting is even good for propensity scores. If you want even more information, read this article.)

4. Balancing the dataset via matching

Use the propensity scores to match each data point from the treated group with exactly one data point from the control group, while ensuring that each data point from the control group is matched with at most one data point from the treated group. (Hint: you may explore the networkx package in Python for predefined matching functions.)

Your matching should maximize the similarity between matched subjects, as captured by their propensity scores. In other words, the sum (over all matched pairs) of absolute propensity-score differences between the two matched subjects should be minimized.

After matching, you have as many treated as you have control subjects. Compare the outcomes (re78) between the two groups (treated and control).

Also, compare again the feature-value distributions between the two groups, as you’ve done in part 2 above, but now only for the matched subjects. What do you observe? Are you closer to being able to draw valid conclusions now than you were before?

5. Balancing the groups further

Based on your comparison of feature-value distributions from part 4, are you fully satisfied with your matching? Would you say your dataset is sufficiently balanced? If not, in what ways could the “balanced” dataset you have obtained still not allow you to draw valid conclusions?

Improve your matching by explicitly making sure that you match only subjects that have the same value for the problematic feature. Argue with numbers and plots that the two groups (treated and control) are now better balanced than after part 4.

6. A less naive analysis

Compare the outcomes (re78) between treated and control subjects, as you’ve done in part 1, but now only for the matched dataset you’ve obtained from part 5. What do you conclude about the effectiveness of the job training program?


Question 2: Applied ML

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

  1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn (link).
    TF-IDF, short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using TfidfVectorizer. Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).

  2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator “n_estimators” and the max depth of the trees “max_depth”. Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the feature_importances_ attribute of your random forest and discuss the obtained results.