OCR desktop

Book and laptopInterested in converting PDFs or images of text to searchable, editable text with more accuracy than Adobe Acrobat? Sign up to use the AEoD optical character recognition (OCR) Research desktop! This research desktop has ABBYY FineReader installed, which supports complex formatting (columns, tables, etc.) and 190 languages (including Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese, and Thai scripts). Export formats include Microsoft Word and Excel. (If you run into permissions problems with the sign-up form, here's how to troubleshoot them.) You can also review the documentation for how to use the AEoD OCR research desktop.

If you have thousands of PDFs to OCR and exact precision is less important, you might want to use Tesseract OCR on Savio, Berkeley's high-performance compute cluster. Tesseract is an open-source OCR engine that doesn't perform as well on documents with complex layout, but it's possible to OCR large corpora of texts in bulk. Contact us at brc@berkeley.edu to talk about how you can get access to Tesseract on Savio.

Curious about which OCR software is right for your project? Research IT has a video available that compares OCR software: