Contact: Ben Bongalon ([email protected])
Retro-digitization is the process of converting a paper-based historical publication into an electronic format suitable for publishing online or for sharing as a digital resource. In this tutorial, you will learn the workflow we developed to digitize a 1953 bilingual dictionary. For details, see our paper "Using Open-Source Tools to Digitize Lexical Resources for Low-Resource Languages" (upcoming).
We designed the workflow to enable even those with modest budgets to conduct their own retro-digitization projects. In doing so, we hope to encourage more communities, especially speakers of minority and indigenous languages, to build e-dictionaries and other digital lexical resources for their mother-tongue language.
You will use sample pages from Harold Conklin's 1953 Hanunoo-English dictionary. Hanunoo (IPA: "hanunuʔɔ") is an indigenous language spoken by ~25,000 Hanunoo Mangyan people in the Philippines. Although they have a native writing system called Surat Mangyan, the dictionary itself had Hanunoo words printed in Roman letters but their pronounciations were denoted with non-Roman letters. These include 5 vowels with diacritical marks (á é í ó ú), the eng character 'ŋ' and the glottal stop 'ʔ' symbol. Here are two sample entries in the dictionary where you can see them used.
You will train the open-source Tesseract OCR engine to recognize the special character 'ŋ' since no existing engine can (the glottal stop symbol will be handled differently, and Tesseract already has a language model that recognizes the vowels with diacritical marks). You will also format the OCR-ed pages into XML then load/edit/display them in a locally-installed Lexonomy dictionary server. How cool is that? :-)
Example dictionary hosted in Lexonomy
- Computer running Ubuntu 18.04 or later (see Note below)
- Python 3 installed
- Admin privilege to install software
- You know how to run commands in a console
To follow along, clone the Git project into your working directory.
$ git clone https://github.com/isawika/retro-digitization.git
$ cd retro-digitization
Note: The tutorial should run on other Linux systems with only minor tweaks, but we have not tested this. Running on Mac or Windows should also be possible but needs more work. Contact us if you want to discuss.
We follow the technical steps outlined in the DariahTeach project, highlighted as blue-ish boxes below: