Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hOCR output format #1275

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add hOCR output format #1275

wants to merge 1 commit into from

Conversation

sliedes
Copy link

@sliedes sliedes commented Jul 2, 2024

This change adds rudimentary hOCR output support. Notes:

  • Currently it just adds bounding boxes, not baselines (which are also supported) to the hOCR output

  • It doesn't add any semantic layout stuff; instead, it just represents each word as an ocrx_word

  • Some of the metadata could be improved, such as adding the real image name and perhaps EasyOCR version number

  • I didn't check if EasyOCR supports multipage inputs; this will certainly break with those if it does

  • I left this comment in the source code; I'm not sure what to do with it (probably shouldn't be enabled by default):

# In order to get a browser-renderable HTML file, you can add this before the closing </body> tag:
#
# <script src="https://unpkg.com/hocrjs"></script>

Other than that, I validated the output with hocr-check from https://github.com/ocropus/hocr-tools and also checked that it validates as XHTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant