Skip to content

Layout analysis and OCR from 17th books in TEI files and their ODD.

Notifications You must be signed in to change notification settings

Heresta/CORPUS17plus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TEI17

This repository contains layout analysis and OCR from 17th books in TEI files and their ODD.

Production

Thoses files were created thanks to a pipeline :

  1. Segmentation and transcription with eScriptorium, using models from datasetsOCRSegmenter17 github repository

  2. Manual correction of ALTO4 files extracted from eScriptorium

  3. Python script pipeline to transform those ALTO4 files in a unique TEI file (see Extractor repository) , adding some metadata (extracted from manifest IIIF and SPARQL requests in data.bnf.fr).

How TEI file is built ?

This TEI file tries to stick at most to TEI all documentation.

So it contains :

  1. teiHeader in which there is all metadata recovered with manifest IIIF and SPARQL request, some information about encoding (use of SegmOnto vocabulary, some information about book's printer(s)

  2. facsimile in which is all layout informations about different zones, lines, and baselines, with pixels coordinates and links to IIIF images

  3. text in which is all transcription, linked to the concerned line

Credits

Documents have been encoded by Claire Jahan with the help of Simon Gabay, as part of the E-ditiones project.

Contact

Claire Jahan : claire.jahan[at]chartes.psl.eu

Simon Gabay : Simon.Gabay[at]unige.ch

Licence

This repository is CC-BY.
Creative Commons License

Cite this repository

Claire Jahan, Simon Gabay. 2021. CORPUS17+ - Corpus of TEI encoded 17th French prints., Paris/Geneva: ENS Paris/UniGE, 2021, https://github.com/Heresta/CORPUS17plus.

About

Layout analysis and OCR from 17th books in TEI files and their ODD.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published