Getting the sources you need for your text mining project

November 16, 2018

Stacy Reardon

You have a great research question that you want to answer with text data mining (TDM) methods, and you've got some Python under your belt or you've decided to see what you can learn from a browser-based tool like Voyant. You're ready to get started on a computational text analysis project. But wait!

Where do you get the texts?

Finding usable data -- full-text collections of novels, newspaper articles, scholarly papers, or other content -- can be challenging because of license restrictions and other roadblocks. (And we don’t recommend scraping an entire library database -- please don't do that. Providers will typically shut down access for the entire campus.)

Fortunately, the Library is here to help!

The Library regularly negotiates with content providers to get you the material you need, and we collaborate with campus partners like the D-Lab to manage access to more restrictive datasets. We've compiled a list of the digital collections, databases, and web sources that are TDM-friendly here in this handy Text Mining & Computational Text Analysis Source guide. The guide includes a flowchart to walk you through determining the best way to get your content. (Quick summary: if it's a library resource, see if it's on our guide or contact us for help. If it's not a library resource, try to find an API or check the site's terms of use before scraping.)

Here are some popular choices from our guide:

HathiTrust Research Center (HTRC): Download ngrams for 14 million books (similar to the content in Google Books) or analyze HTRC's full-text collection through its Data Capsule program.
Project Gutenberg's mirrored sites: Over 50,000 public domain ebooks, with a strength in literature. The mirrored sites allow you to scrape books at scale.
The New York Times Annotated Corpus: 1.8 million articles from the New York Times between January 1, 1987 and June 19, 2007.
JSTOR Data for Research: download word frequencies, citations, key terms, and ngrams for scholarly journal articles in JSTOR.

Campus experts are available by email at tdm-access@berkeley.edu to answer questions or help you figure out access to data not already on our list. Happy computing!

Program:

Research Data Management

Partnership:

The Library