Skip to content
Saumya Shah edited this page Aug 14, 2018 · 9 revisions

Free UK Genealogy - Google Summer of Code 2018

To view all the projects being undertaken by Free UK Genealogy in GSoC 2018, click here.

Probate Parsing

Aim

Free UK Genealogy aims to launch a new project to expose genealogical information from wills and probate books. These books record the date and location of people's deaths, their occupations, and often the same information about the family members that executed the wills.

In previous projects, all this material was transcribed manually by volunteers, as the source documents were handwritten. The probate books are different, however, in that they are printed and thus are accessible to OCR. We should be able to use OCR text to seed a database by parsing the text for names, dates, occupations, and relationships. We should also be able to use OCR bounding box coordinates to associate regions of a scanned page with an entry for presentation to researchers or for correction by volunteers.

Project Background

Free UK Genealogy has a simple mission - “Human transcription of family data”, but it can have many difficulties. One of the biggest problems that such organizations face is the curation and management of data. To this date, many communities prefer to transcribe their family history information on pen and paper. Moreover, families wishing to explore their lineage are rendered helpless due to the lack of a system that facilitates an accurate record system. With a scope for millions of records, the role of technology in these systems is very relevant.

I believe this project is very relevant to the organization as most of the genealogical data present in the world today is either handwritten or transcribed in books. While there is a dire need to digitize the information, the process of consolidating data in print to records in a database requires tedious labor from volunteers. Instead of manually entering data into the system, we can automate it from start to end.

Repository Link

https://github.com/FreeUKGen/ProbateParsing

Approach

The system that I have implemented is an end-to-end system that extracts the text from probate books and seeds them into a database with entities such as name, county, date, relationships etc. This system can, therefore, be broken down into four phases -

Bounding Boxes using Image Processing

In phase 1, the scanned images of probate books are refined using Image Processing methods. For example, one page in a probate book consists of many entries. The aim of this phase is to isolate each entry by cropping the original image entry-wise to get the best possible output owing to some OCR discrepancies.

Text extraction using Optical Character Recognition

In phase 2, the cropped images of the probate books are parsed under an Optical Character Recognition algorithm to generate the information in the text.

Named Entity Recognition using Language Processing

In phase 3, one of the most challenging parts of the project involves extracting meaningful information from the text generated in Phase 2. For the algorithm to extract particular fields like name, relationships, and occupation, it must “learn” the semantics of each probate entry. Methods that tackle such problems form an integral part of Information Extraction called Named Entity Recognition. The model will be a continually learning model, that improves itself with every data item that is added into the database. For an organization that makes use of millions of data records, such a solution would not only prove to automate a great deal of data curation but also prove to be computationally more powerful.

Database Seeding based on the entities generated

In phase 4, based on the entities generated in phase 3 for each probate record, the database is populated with the same values.

Technologies used

Python, Tesseract-OCR, Pillow, SpaCy, Pandas