Legal scholars mining millions of bankruptcy case pages

February 20, 2018

Large corporate bankruptcy cases don’t easily lend themselves to empirical research, according to UC Berkeley Law Professor Ken Ayotte, because “sample sizes are small, and the financial data that’s available on the company leading up to bankruptcy is usually sparse and unreliable. We know when the company files, we have some basic background information about it, and we see whether the company reorganizes or liquidates at the end of the case, but we know very little about what happens during the case to drive those outcomes.”

Assembling a corpus of bankruptcy case filings

To discover what the parties in bankruptcy cases are really after -- individually, and in the aggregate across many Chapter 11 cases -- Ayotte teamed up with Associate Professor Jared Ellias of UC Hastings School of Law, Chris Hench of the D-Lab, and Research IT to assemble a large corpus of bankruptcy court documents that they will interrogate using algorithmic Natural Language Processing (NLP) methodologies. Professor Ayotte explains:

“... we think that natural language processing can help us look inside the black box of a Chapter 11 bankruptcy case to get a more detailed and nuanced view of what outcomes the various constituencies--secured creditors, unsecured creditors, and debtors--are trying to achieve, and how these objectives can change throughout the case... Existing research  based in financial economics gives us some predictions about what outcomes the parties are trying to achieve in the case based on the claims they hold, but we have no direct evidence that the parties are actually advocating for these outcomes in front of the judge.  We hope that text analysis will help us uncover these patterns and help us develop more nuanced hypotheses for future research.” 

The team obtained photographic scans of bankruptcy case filings from some of the the federal districts that handle the largest volume of these cases -- including Delaware and the Southern District of New York -- via the online service Public Access to Court Electronic Records (PACER), run by the Federal Judiciary. The yearlong data gathering phase of their project yielded a corpus of multiple millions of pages from over half a million documents.

What kinds of useful or interesting conclusions could be drawn from algorithmic examination of this large corpus of textual data? Professor Ellias suggests, for example, “We could try to look to understand why some bankruptcy cases last longer than others, and what sort of exogenous pressures contribute to some cases lasting months and some cases lasting years, and to what extent are those predictable based on things you can find out early on, and can you use text analysis to build a better predictive model than you can if you’re just looking at metadata.”

Turning document scans into analyzable text

When legal documents are scanned for submission to a court via PACER, the digitized result is an image of the document’s pages. To ready them for examination using NLP software, each of these page images must be analyzed by optical character recognition (OCR) software, which translates the page images into parseable, searchable text. Because of the well-structured nature of legal documents, significant categorizing metadata can usually be associated with their OCRed data, such as case numbers, dates, names of the parties and their legal representatives, the type of document section within which a given span of text occurs, and attribution of what each party’s attorneys said to the judge.

To address technical challenges, Ellias worked with Chris Hench, a consultant at the D-Lab, who recently completed his PhD at UC Berkeley, in German Literature and Culture and Medieval Studies. In early phases of the project, Ellias and Hench tried to use OCR technologies that they eventually deemed less than ideal. ABBYY FineReader, a commercial product that is well suited to performing OCR on documents written in languages other than English and non-Roman alphabets, accurately rendered the bankruptcy documents as text files but could not be scaled to a very large number of pages at an affordable cost. Tesseract -- software originally developed and open-sourced by Hewlett-Packard, then developed since 2006 by Google -- was free, but its results were not accurate enough to use. Only with the emergence of early (alpha) versions of Tesseract 4, employing recurrent neural networks in an LSTM (long short-term memory) engine, did the software’s OCR results become sufficiently accurate. As Prof. Ellias acknowledged, “It’s absolutely crazy how the technology that made this project possible was being developed while we were doing it.” Tesseract 4 is expected to be released later in 2018.

Berkeley Research Computing (BRC) Cloud Architect Aaron Culich partnered with the research team to obtain access to national infrastructure computing resources run by XSEDE. First, in his capacity as UC Berkeley’s XSEDE Campus Champion, Aaron provisioned a starter allocation to allow the team to test-drive Jetstream as a vehicle suitable for obtaining PACER’s data at the desired scale. Next, he helped the researchers apply for a free allocation on the XSEDE resource. Virtual machines (VMs) on Jetstream did most of the work of gathering bankruptcy document scans from PACER.

BRC Research Infrastructure Architect Maurice Manning supported the OCR aspect of the project by installing and configuring Tesseract 4 (alpha) in a Singularity container, which was a prerequisite for running the software on UC Berkeley’s shared condo cluster, Savio. Manning then built a Jupyter Notebook from which Hench and Ellias could launch Tesseract jobs on the cluster. The researchers used Professor Ayotte’s Faculty Computing Allowance to perform OCR on Savio.

Next steps: Natural Language Processing at scale

Now that Professors Ayotte and Ellias have assembled their corpus as OCRed text, they are positioned to move into the analytical stages of their research. To do so, they plan to consult with campus faculty whose expertise in Natural Language Processing (NLP) can direct them to the most advanced and suitable techniques to extract meaningful conclusions from their massive pile of data.

Chris Hench points out characteristics of the data he and the law professors assembled over the past year that are likely to excite the interest of NLP experts: “The size is obviously really attractive. The domain, of people talking about why companies go bankrupt, is interesting. The fact that you have actual dialog is rare -- not only is it dialog, but it’s pretty well recorded by transcribers [...] real back and forth between a lawyer and a judge. And I think the most exciting part for a computational linguist is having so much text that is decently structured” -- in which different parts of standard-format legal documents are consistently labeled across a large number of documents from different cases, which is not characteristic of most large corpora of digitized text.

To the question whether their project is likely to inspire other researchers to mine well-structured legal corpora, Prof. Ellias responds, “Text is certainly a research frontier in law, as it is across the social sciences -- and especially in finance, which is the closest discipline to our particular silo in law.” Professor Ayotte notes that there have been papers published on textual analysis of corpora of contracts, but that text analysis in the  business litigation context is relatively new.

Berkeley Research Computing looks forward to further partnership with Professors Ayotte and Ellias, and with D-Lab, to support this ongoing research project. If you have a research project that involves Optical Character Recognition at scale, we hope to support your work as well. Please send a note to, and we’ll be happy to set up a consultation!