How to Retro-Digitize a Historical Dictionary

Contact: Ben Bongalon ([email protected])

Retro-digitization is the process of converting a paper-based historical publication into an electronic format suitable for publishing online or for sharing as a digital resource. In this tutorial, you will learn the workflow we developed to digitize a 1953 bilingual dictionary. For details, see our paper "Using Open-Source Tools to Digitize Lexical Resources for Low-Resource Languages" (upcoming).

We designed the workflow to enable even those with modest budgets to conduct their own retro-digitization projects. In doing so, we hope to encourage more communities, especially speakers of minority and indigenous languages, to build e-dictionaries and other digital lexical resources for their mother-tongue language.

What You'll Do

You will use sample pages from Harold Conklin's 1953 Hanunoo-English dictionary. Hanunoo (IPA: "hanunuʔɔ") is an indigenous language spoken by ~25,000 Hanunoo Mangyan people in the Philippines. Although they have a native writing system called Surat Mangyan, the dictionary itself had Hanunoo words printed in Roman letters but their pronounciations were denoted with non-Roman letters. These include 5 vowels with diacritical marks (á é í ó ú), the eng character 'ŋ' and the glottal stop 'ʔ' symbol. Here are two sample entries in the dictionary where you can see them used.

You will train the open-source Tesseract OCR engine to recognize the special character 'ŋ' since no existing engine can (the glottal stop symbol will be handled differently, and Tesseract already has a language model that recognizes the vowels with diacritical marks). You will also format the OCR-ed pages into XML then load/edit/display them in a locally-installed Lexonomy dictionary server. How cool is that? :-)

Example dictionary hosted in Lexonomy

Prerequisites

Computer running Ubuntu 18.04 or later (see Note below)
Python 3 installed
Admin privilege to install software
You know how to run commands in a console

To follow along, clone the Git project into your working directory.

$ git clone https://github.com/isawika/retro-digitization.git
$ cd retro-digitization

Note: The tutorial should run on other Linux systems with only minor tweaks, but we have not tested this. Running on Mac or Windows should also be possible but needs more work. Contact us if you want to discuss.

The Workflow

We follow the technical steps outlined in the DariahTeach project, highlighted as blue-ish boxes below:

Step 1: Planning
Step 2: Image Capture
Step 3: Text Capture
- 3.1 - Prepare the Training Data
- 3.2 - Finetune Tesseract (train the OCR language model)
- 3.3 - Transcribe the dictionary pages with the trained model
- 3.4 - Proofread the pages
Step 4: Data Modeling & Enrichment
Step 5: Publish

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
images		images
tutorial		tutorial
LICENSE		LICENSE
README.md		README.md
Step2-ImageCapture.md		Step2-ImageCapture.md
Step3-TextCapture.md		Step3-TextCapture.md
Step3.1-PrepareTraining.md		Step3.1-PrepareTraining.md
Step3.2-Finetune.md		Step3.2-Finetune.md
Step3.3-Transcribe.md		Step3.3-Transcribe.md
Step3.4-Proofread.md		Step3.4-Proofread.md
Step4-DataModeling.md		Step4-DataModeling.md
Step5-Publish.md		Step5-Publish.md
conklin2xml.py		conklin2xml.py
pdf2tiff.sh		pdf2tiff.sh
runocr.sh		runocr.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to Retro-Digitize a Historical Dictionary

What You'll Do

Prerequisites

The Workflow

About

Releases

Packages

Languages

License

isawika/retro-digitization

Folders and files

Latest commit

History

Repository files navigation

How to Retro-Digitize a Historical Dictionary

What You'll Do

Prerequisites

The Workflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages