Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handwritten documents and ALTO encoding - how to make ALTO more suitable for such documents - ideas #81

Open
cipriandinu opened this issue Oct 14, 2022 · 3 comments

Comments

@cipriandinu
Copy link
Member

Handwritten documents are more and more present into current projects and even ALTO can be used today to define a page layout and text information for this type of materials, I think there is still place for improvement. One recent change was related to baseline definition, that was changed from a float value (y coordinate of the line) to PointsType, since for handwritten text the baseline is not a straight line. Probably there are much more issues related to this topic that we can discuss and improve.

This topic is intended to be a place for collecting ideas for further discussions, from here we will collect most important topics and create individual issues

@cipriandinu
Copy link
Member Author

I have asked some people from Transkribus why they choose PAGE instead of ALTO, and what ALTO is missing to be a better format for handwritten comunity, and here is the answer:

"As far as I remember, we chose PAGE as

  • it was designed specifically with GT in mind and there was already a good amount of training data available in that format.
  • researchers in the project often already had IO libraries for the format
  • it allows to define polygonal regions/lines for cropping (I think the major OCR formats only allowed rectangular blocks back then. Correct me if I am wrong)
    • to capture bent/twisted lines in handwriting
    • to separate overlapping lines as far as possible (e.g. ascenders/descenders of characters might still cross other lines)
  • baselines with multiple points were added quickly on request in 2013
  • the text representation can be added on any level (regions, lines, words) without the need to go into more detail if not needed. Most tools in the project worked on line level only and therefore this was most important to have."

From here I see one topic we may think on future (since some of the features missing at one point in time are already added, like polyline baseline, polygonal shape on all levels, etc.):

  1. Allow CONTENT on any level, without the need to go deeper into the structure if not needed (f.e. full text line content just below the Textline). Discussion would be if we keep the deeper structure as mandatory for ALTO produces, but make consumer life easier, or we let details as optional on any level (this could lead to a very simple ALTO containing just plain text as part of a single block... ). Might be useful if we look from GT perspective, from presentation systems point of view may not be useful at all.

@jukervin
Copy link
Member

Recording different writers can be done with Tags?

@M3ssman
Copy link

M3ssman commented Dec 15, 2022

When working with Transkribus-SWT to generate GT my colleagues and I found ourselves several times running into trouble because we forgot to synchronize text line and word contents. The major advantage (IMHO) for ALTO compared to PAGE is the singular store point for OCR content, especially when one aims to create GT at least on word-level, as we do.
Allowing content on text line level might introduce problems with reading order as well when mixing RTL and LTR languages in the same line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants