Query regarding OCR engine #2672

anankus1 · 2024-03-20T14:38:46Z

anankus1
Mar 20, 2024

As per the documentation, regarding the partition strategy "ocr_only" - The "ocr_only" strategy runs the document through Tesseract for OCR and then runs the raw text through partition_text.

I want to know if there any option to integrate to another OCR engine.

ericfeunekes · 2024-04-13T09:19:33Z

ericfeunekes
Apr 13, 2024

I've been wondering the same thing. But my guess is that the ocr engine is tightly coupled to the partition library because the ocr engine would output data in a specific way.

My biggest issue is that tesseract doesn't support GPUs so that slows down the extraction if you want to use high-res for example. Particularly for large documents.

There are some newer approaches like https://github.com/VikParuchuri/surya that are promising, allow for multiple parallel processing, and run on multiple GPUs if needed. It would be interesting to see if something like that could be supported.

0 replies

JoshC8C7 · 2024-04-26T13:11:59Z

JoshC8C7
Apr 26, 2024

This is possible now just not documented - see here

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query regarding OCR engine #2672

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Query regarding OCR engine #2672

anankus1 Mar 20, 2024

Replies: 2 comments

ericfeunekes Apr 13, 2024

JoshC8C7 Apr 26, 2024

anankus1
Mar 20, 2024

ericfeunekes
Apr 13, 2024

JoshC8C7
Apr 26, 2024