Digital Humanist aims to run OCR over a terabyte of rare book scans

April 20, 2017

Emilia Malachowski

Since his college days at Brigham Young University (BYU), Adam Anderson has been measuring evenings and weekends in pages, rather than hours. “You can scan about 400 pages an hour, once you get in the groove,” he explains. Anderson, a Mellon Postdoctoral Fellow in Digital Humanities at UC Berkeley, has spent his career scanning texts in order to draw upon secondary literature in archaeology and computational linguistics. Anderson described this creation of his own database through scanning as a “personal digital library I can take wherever.” Anderson is using Optical Character Recognition (OCR) software to digitize the text he has scanned, and using the results to quantify social and economic landscapes emerging during the late third to early second millennia in the ancient Near East. The computational methods — including natural language processing and social network analysis — that underpin his research have been highlighted through venues like the Harvard Horizons symposium.

While BYU has a sizable collection of materials on ancient and Near Eastern studies, the foundational works in Anderson’s field are mostly written in French and German, date to the late 19th century, and are only available in a few elite research universities. It was a boon to Anderson’s collection when he began his graduate studies at Harvard, giving him access to one of the country’s premier research collections in his field. Research trips to Hebrew University, the University of Munich Ludwig-Maximilians, and the University of Copenhagen — always with a small flatbed scanner in tow — further expanded his personal digital library. Anderson used the flatbed scanners to capture images of the texts he was working with while on site. Today, Anderson is processing the page images and converting them to digitized text using ABBYY FineReader and Tesseract software on computing resources provided by Berkeley Research Computing (BRC). These OCR softwares running on powerful servers have helped increase the rate at which page images can be processed in bulk, and the accuracy with which Sumerian text can be identified within the multi-language documents. Typically, it takes around 25 seconds to run OCR software over a one page image in a PDF file. Open source engines like Tesseract are valuable because they process multiple-language page images, and can process documents on a large scale without human intervention. Ideally, Anderson hopes to access all of his scanned and OCR-digitized materials in the cloud, where he can listen to them on a headset and see them in a 3D format. Anderson noted that “this doesn’t seem too far off”, given how rapidly OCR engines are improving and the quantity of page images he has scanned since 1999—all 1.5 Terabytes of it (with which he has nearly filled the storage quotas of two Google accounts).

Having easy access to his personal digital library anywhere there’s an internet connection has done much more for Anderson than reduce his number of trips to the library. Both the legal framework and accepted scholarly practice surrounding archaeological findings has evolved considerably since the 19th century, when western scholars took artifacts from where they were discovered to wherever their research was based. Now, excavated objects remain in their country of origin, and laws restrict western scholars from publishing about objects until scholars in the source country have done so first. While Anderson was photographing tablets in a museum in Turkey last summer, the museum director approached him, claiming the tablets had not been scanned previously, and thus were not available for Anderson to photograph, as publication rights were reserved for Turkish scholars. Anderson’s collection came in handy in these situations: he was able to pull up PDF files of papers in which the tablets had already been published, showing they had already been scanned and thus sidestepping legal complications that would have arisen had he been unable to prove prior publication.

It was in the process of scanning primary and secondary texts for his upcoming book that Anderson became involved with Berkeley Research Computing. While going through his documents, Anderson found that he needed a way to identify Sumerian text within the documents quickly and automatically, so that he did not have to review each document manually. Anderson is using open source OCR engines running on BRC servers to “structure unstructured data”. Rather than a collection of page images grouped in PDFs, Anderson seeks to create a corpus of machine-readable, well-categorized ancient texts using high-powered OCR. The primary goal in Anderson’s current work with BRC is to identify cuneiform primary sources within documents quickly. Since Tesseract is faster than ABBYY FineReader when it comes to processing multiple documents — because ABBYY FineReader requires that each OCR conversion be initiated manually, while Tesseract does not — Anderson is using Tesseract running on Savio, the campus High Performance Computing cluster, to scan a representative sample of the ancient texts.

Given that there are hundreds of thousands of documents on Anderson’s Google drive, it’s not yet clear what portion of his data can be scanned using Professor Niek Veldhuis’ Faculty Computing Allowance, who generously offered this allowance for Anderson’s project. To address this limitation, small batches of documents have been moved onto the Savio cluster gradually for processing, as opposed to performing one large data transfer. After the documents are moved onto the Savio cluster, BRC’s Cyberinfrastructure Engineer Maurice Manning executes trial runs of Tesseract on the files; once the OCR run is complete, the documents are moved back onto the Google drive and deleted from Savio’s scratch (temporary) storage. During the OCR process, Tesseract gives each transcripted word a score — a low score indicating the system did not recognize the word well, thus likely mistranslating it; and a high score meaning the word was easily detected and translated. Since Tesseract does not deal well with horizontal pages or graphics/tables, the software was modified for this project so that it assigns low scores to words with dashes between them (similar to hyphenated words), which is characteristic of how cuneiform text is transliterated. This modification increases the chance that low-scoring words are cuneiform text. The initial, trial runs of Tesseract over Anderson’s corpus will allow development of an estimate of how intensive a computational task it will be to process the entire corpus. This will help predict whether or not Anderson’s project can fit within the FCA limits, or whether he’ll need to apply for an allocation of national infrastructure resources to realize his goal.

Overall, Anderson notes that his work is “constantly testing the bounds of technology,” bringing the past into our present, merging cultures, languages, and stories together in one interwoven web, and helping us understand our infinite connections to ancient times. To learn more about Adam and the work that he does, listen to his Harvard Horizons Talk or visit the Berkeley Digital Humanities website. If you are interested in learning more about BRC or any of the open source OCR engines available through the program, please email research-it@berkeley.edu.