Digitizing 17th century Dutch documents

May 10, 2017

As a Digital Humanities Project Archivist, Julie van den Hout has made good use of the Berkeley Research Computing (BRC) Program’s free, virtualized instance of ABBYY FineReader to create a searchable corpus of seventeenth century Dutch-language documents. These documents, from the Bancroft Library’s Engel Sluiter Historical Documents Collection, represent only a fraction of the ninety two cartons of materials on seventeenth century trade amassed by the late UC Berkeley Professor Sluiter, but they happen to be among the most interesting to van den Hout -- who is passionate about seventeenth century Dutch language and writing, the Dutch Golden Age, and the Colony of New Netherland (now New York) -- and were a tractable quantity of materials for her digitization project.

In 2015, after a twenty year career in medicine as a cardiovascular perfusionist, van den Hout completed a Bachelor of Arts degree in Linguistics and Dutch Studies at UC Berkeley. She will begin a Master’s program in History at San Francisco State University this coming fall, where she plans to work with a little-known account of the funeral of England’s Queen Elizabeth I, as well as other materials written by a Dutch ambassador to London between 1585 until several years into the reign of King James I. Before she completes the first year of her Master’s program, van den Hout’s biography of Adriaen van der Donck of the colony of New Netherland will be published by SUNY Press (forthcoming, April 2018).

The seven hundred documents the Bancroft Library digitized as part of van den Hout’s project -- funded through a Digital Humanities at Berkeley collaborative research grant between Bancroft and the Dutch Studies program -- include many contracts, a full-length book transcribed by Prof. Sluiter from a handwritten manuscript, testimonies of court cases heard aboard trading vessels, and occasional letters. Seventeenth century Dutch was represented in a different orthography (word spelling differed) from the modern language, and would require proofreading because most Optical Character Recognition (OCR) software uses modern dictionaries. Manual interventions and corrections made by van den Hout would therefore be necessary to accurately transform digital page scans to digitized (OCR’d) text. ABBYY FineReader best fit the project’s needs; and a licensed copy of the software in a virtualized research environment that could be used remotely via the BRC Program’s AEoD service, fit van den Hout’s need to work on a well-provisioned computational resource from her home office, a couple of hours’ drive from the Berkeley campus. Remote and in-person consultations with BRC’s digital humanities specialist, Quinn Dombrowski, helped van den Hout to resolve problematic issues from loading large page image files, to cropping out Prof. Sluiter’s handwritten marginalia and underlining. Ultimately, despite the manual steps needed to correct the texts, van den Hout was impressed by ABBYY FineReader’s ability to recognize seventeenth century Dutch text as well as it did. 

Now that the materials have been converted to digitized text and will be made available to other researchers, their accessibility will undoubtedly lead to surfacing of still-undiscovered themes and patterns in the corpus, using keyword extraction and frequency counting as well as topic modeling software in the next phase of the project. As a website is developed to house the digitized documents, van den Hout looks forward to writing about the collection with an eye to inviting scholarly inquiry.

Please contact Research IT at research-it@berkeley.edu with any questions or interest in using ABBYY FineReader or other Optical Character Recognition tools.