Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict float attribute values where possible to allow for better xml-validation. #62

Open
jukervin opened this issue Aug 20, 2019 · 5 comments
Assignees
Milestone

Comments

@jukervin
Copy link
Member

jukervin commented Aug 20, 2019

For example ALTO schema allows negative float values in attributes like WIDTH, HEIGHT, HPOS, VPOS where values should be positive.

Validating against schema doesn't catch documents where software has created nonsensical values.

xsd:float is used in following attributes:

PageType

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • PHYSICAL_IMG_NR" type="xsd:float" : The number of the page within the document.
    -> values should be always positive integers?
  • ACCURACY type="xsd:float" : Estimated percentage of OCR Accuracy in range from 0 to 100
    -> values should be limited to 0-100 floats?

ParagraphStyle

  • LEFT type="xsd:float" : Left indent of the paragraph in relation to the column.
  • RIGHT type="xsd:float" : Right indent of the paragraph in relation to the column.
  • LINESPACE type="xsd:float" : Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline.
  • FIRSTLINE type="xsd:float" : Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right.

BlockType

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"
  • ROTATION type="xsd:float" : Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise.
    -> Limit to 0-360 float?

SPType

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"

StringType

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"

PageSpaceType

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"

EllipseType

  • HPOS type="xsd:float"
  • VPOS type="xsd:float"
  • HLENGTH type="xsd:float"
  • VLENGTH type="xsd:float"
  • ROTATION type="xsd:float"
    An ellipse shape. HPOS and VPOS describe the center of the ellipse. HLENGTH and VLENGTH are the width and height of the described ellipse.
    The attribute ROTATION tells the rotation of the e.g. text or illustration within the block. The value is in degrees counterclockwise.
    -> Limit to 0-360 float?

CircleType
A circle shape. HPOS and VPOS describe the center of the circle.

  • HPOS type="xsd:float"
  • VPOS type="xsd:float"
  • RADIUS type="xsd:float"

formattingAttributeGroup

  • FONTSIZE type="xsd:float" : The font size, in points (1/72 of an inch).

HYP

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"

TextLine

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"
  • BASELINE type="xsd:float"

GlyphType

  • HEIGHT type="xsd:float"
  • WIDTH type="xsd:float"
  • HPOS type="xsd:float"
  • VPOS type="xsd:float"
@jukervin
Copy link
Member Author

Based on the discussion in the meeting 2020-12-13 the ROTATION can have valid negative values.

@jukervin jukervin self-assigned this Dec 13, 2019
@jukervin
Copy link
Member Author

Limiting values to positive will break backward compatibility so this can be changed in ALTO 5.0 release at the earliest.

@Ra1phM
Copy link
Member

Ra1phM commented Dec 17, 2019

I agree that limiting WIDTH, HEIGHT, HPOS, VPOS to positive values would make sense. From my understanding, HPOS and VPOS are always in relation to the entire page, so values outside of the page's real dimensions (e.g. (-100, -100)) should not exist.

However, for the ParagraphStyle LEFT and RIGHT indent, I am not sure if positive values should be enforced, because the indent value is relative to the paragraph and even if it is not a good practice, it would still be valid in the same way that negative values (or positions, margins and paddings) in HTML are accepted.

@cipriandinu cipriandinu added this to the v5.0 milestone Oct 14, 2022
@jukervin
Copy link
Member Author

Maybe XSD 1.1 schema asserts could be used to create validations that rely on other element values: PrintSpace can't be larger than Page etc.

cipriandinu added a commit that referenced this issue Jan 16, 2024
First sample of assert usage for restricting Page values
@cipriandinu
Copy link
Member

A new branch added (issue-62) to make changes for this topic. There are several things to discuss, since the only option to completely implement this is to use xsd 1.1 (to restrict not only some values to be positive, but also restrictions like height of a block + vpos < page height). On the new branch, there are just fie restrictions on Page level, but there is much more to be done.

One topic to clarify is if we want to go with this solution and enforce validation with xsd 1.1 processors. Or if we implement only simple restrictions (positive values, no relative restrictions)? If we go for full solution, probably when switching to 5.0 would be a good moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants