The increasing prominence of interdisciplinary studies, coupled with the mass digitization and dissemination of primary sources and scholarly literature alike, has made it possible for scholars to pursue boundary-defying research agendas in ways that were previously impossible. However, language remains a significant barrier to interdisciplinary work. Particularly when pursuing research questions that stretch across space and time, scholars inevitably encounter materials in languages with which they are unfamiliar. Without a translation, it may be impossible to tell if a given document is relevant to the researcher’s work. At the same time, professional translation is expensive, and sending translators every potentially relevant document would be prohibitively expensive and time-consuming. To resolve this conundrum, researchers would benefit from some way to understand the gist of documents in order to prioritize them for translation.
In fall 2017, Maurice Manning, Research IT’s cyberinfrastructure engineer, began to explore whether Research IT could address this need. Manning had previously worked with humanities and law scholars to develop Jupyter notebooks programmed to facilitate optical character recognition (OCR) tasks that convert an image of a book or document into editable, machine-readable text. This would be a necessary prerequisite for any machine-translation work. Manning worked with Nicolas Chan, a freshman interested in pursuing computer science and one of the operations interns for Berkeley Research Computing (BRC), on the initial stages of the project. With Manning’s departure from Research IT in December 2017, Chan took over as lead developer.
Unlike some of the previous projects Research IT had supported, where middling OCR quality was acceptable for a computational analysis of millions of pages of text, precise OCR was essential. The more text you can give Google Translate, the better it performs, so long as the input text is accurate. However, OCR errors -- even in function words (e.g. “the”, “if”, “then”, etc.) that would be ignored in computational text analysis -- can significantly disrupt the accuracy of Google Translate. Many researchers are already familiar with using Google Translate for machine translation of short snippets, and Research IT staff adopted it for this project because of the availability of an API, which means it could be integrated into a computational workflow.
Research IT offers a number of options for OCR, including theTesseract OCR engine on Savio (which had underpinned Research IT’s earlier OCR projects). Unfortunately, Tesseract’s OCR quality can be inconsistent, causing problematic inaccuracies in translation. The OCR desktop, which provides access to the professional-quality ABBYY FineReader OCR software via an Analytics Environments on Demand (AEoD) virtual machine, requires more work and time on the researcher’s part, but yields much better results. Chan developed a workflow in which researchers can store their documents in bDrive (Google Drive cloud storage), download them to an AEoD virtual desktop, perform OCR using FineReader and correct the OCR as needed, then return plain text files of the OCR results to a specified output folder in bDrive. Chan developed a script that would look for new plain text files in a specified input folder, send the text to the Google Translate API (one of Google’s cloud services), and return translated plain text files to bDrive. While Chan did not have previous experience working with Google’s cloud APIs, he was able to adapt and expand code that Manning had developed for moving data back and forth from bDrive to develop the workflow.
Early, small-scale tests of this approach were successful, but Chan quickly ran into challenges once he began to scale up to longer documents. The Google Translate API accepts a maximum of 5,000 characters per request, but splitting a document into chunks of a set character length (e.g., 2,000 characters) could split words into nonsense fragments. Even splitting at word breaks could render an entire sentence unintelligible to Google’s translation algorithms. To successfully address this problem, Chan found a Python library (nltk) that uses language-specific features to identify sentence boundaries. To enable the script to determine which language to use for parsing when splitting a document into chunks, Chan proposed a naming convention that prepended the appropriate three-letter ISO 639-2 Language Code to each file name.
After addressing sentence parsing, the biggest challenges Chan faced were connected to Google’s cloud APIs themselves. He had to make multiple adjustments to the code to accommodate Google’s rate limiting and request blocking. Every step of developing the notebook required testing, and Chan went through $150 of the $300 in free credits available through Google (the Google Translate API is not available through Google’s “always free tier”). Once the code was functioning as expected, however, Google’s $300 in free credits would provide a researcher with approximately 6,000 pages of machine translation at no cost. A beta version of Chan’s code is now available on Research IT’s Github, along with instructions for using it.
Carla Shapreau, lecturer in the Law School, senior fellow in the Institute of European Studies, and curator in the Department of Music, beta tested the code. She credits Research IT with contributing to the process of her multilingual translation research on campus. She noted, “the output can often provide useful content in the preliminary stages of the translation process and contributes to the selection of relevant records for human translation, necessary for accuracy and refinement.” Shapreau was particularly struck by the impact of the notebook Research IT and Nicolas Chan designed when a colleague sent her a 116-page thesis in a foreign language while she was at a symposium. In about 45 minutes she was able to review “an imperfect but useful version” in machine-translated English. Shapreau said, “Research IT is a remarkable campus resource. Rick Jaffe thoughtfully steered me to Quinn Dombrowski and Maurice Manning, who launched this notebook project. Quinn managed and kept the project on course, while our talented and skilled freshman, Nicolas Chan, wrote and rewrote code, continuously tweaking the notebook to keep it running on an array of varied documents in a host of languages. This productive intersection between campus humanities projects and Research IT has resulted in a useful research tool in the translation process that will undoubtedly be of interest and use to other scholars.”
Chan looks forward to returning to Research IT in the fall: “Working with researchers is both fulfilling and motivating because I can immediately tell that the work I do is meaningful and helpful for advancing research. This project in particular provided me with the opportunity to design a workflow, refine my skills in Python notebook programming, and interact with Google APIs, all of which will likely be useful in creating future workflows. Research IT has provided me with the opportunity to learn new skills while making positive impact on research in a variety of fields across UC Berkeley, and I can’t wait to continue doing so during my time here.”
The AEoD OCR desktop is currently undergoing an extended maintenance period for the summer as BRC reassesses its service offerings in response to campus needs. ABBYY FineReader is still available on the desktop computers in the D-Lab. If having access to FineReader from anywhere via a virtual desktop would be useful to your research, please contact us at email@example.com.