Text as Data Project

Objective: to answer a social science question using text data

Topic and Scope

To complete this project, your group must identify a text data source and analyze it. You can either build a novel data set (via web scraping or more conventional means) or use an existing data set that contains text data. Web scraping will typically involve text data in at least a trivial sense, since most scraped data arrives (at least partially) in string format and will require appropriate preprocessing. Any other type of data set that you put together or identify must contain at least some text variables or components.

If you are using a pre-existing data set, your project must use at least some of the text analysis tools that we will cover in class: converting text into appropriate numeric variables (e.g. tf-idf), clustering, sentiment analysis, etc.

Your project must do at least one of these things, but ideally you should do more than one. That said, I would like to to find a topic that you are excited about, and I would rather relax the details about the structure of the project if that encourages you to do something innovative and ambitious.

Some potential data sources:

Here are some potential data sources:

NBER data on all economics working papers in the NBER Discussion Paper series
Any data set about Williams College (i.e. about faculty, courses, students etc.) that you build by extending the tools used in Lab 10.
Melissa Dell at Harvard has a large data set (probably too large) of U.S.~newspaper headlines
SEC filings

The Finished Product

You will make an in-class presentation that should last between 15 and 18 minutes. Your presentation should be based on slides that summarize your data and your findings, and you will also upload the code used to process your data and generate your outputs. Your presentation should outline your data sources, describe your preprocessing and analysis, and explain your findings.

As with the Exploratory Data Analysis project, your goal is to articulate a clear research question (or questions) and provide a compelling answer to it using data. Your classmates and I will be evaluating both the quality of your question and the quality of your answer. I am also looking for a set of slides that is well-formatted, polished, and complete - supported by replication files that transform the raw data into all of the final outputs that you present.