Filtering Non-Latin Text During Detection to Improve OCR Accuracy in docTR #1774

Shamyukthaaa · 2024-11-12T11:28:50Z

Shamyukthaaa
Nov 12, 2024

I’m working on an OCR solution using docTR and am focusing on recognizing only English (Latin) text from images. In cases where images contain text in multiple languages (e.g., English and Hindi), I’m seeing higher recognition errors as the model tries to interpret non-Latin scripts as well. This results in false OCR outputs for text I’m not interested in.

Here’s an example image that illustrates the issue:

In this image, there’s English text ("SILENCE IS THE BEST ANSWER TO ANGER") as well as Hindi text ("क्रोध को जीतने में मौन सबसे अधिक सहायक है"). My goal is to configure the model to detect and recognize only the English text, ignoring any non-Latin scripts.

Currently, I’m using the following setup:

model = ocr_predictor(det_arch='db_mobilenet_v3_large',
                      reco_arch='crnn_vgg16_bn',
                      pretrained=True,
                      assume_straight_pages=False,
                      straighten_pages=True,
                      detect_orientation=True,
                      disable_crop_orientation=True,
                      disable_page_orientation=False
                      )

As per the documentation, the detect_language feature works at the page level but not the word level, which means I can’t filter out non-English words selectively.

My Question: Is there a way to filter non-Latin text at the detection stage itself, so that only English (Latin) words are detected and sent to the recognition model? Ideally, I would like the detection model to ignore any regions with non-Latin scripts, which would reduce false OCR results.

For reference, I am specifically looking to filter for characters in:

digits: string.digits
ascii_letters: string.ascii_letters
punctuation: string.punctuation

Any guidance on achieving this or suggestions for potential workarounds would be greatly appreciated!

felixT2K · 2024-11-12T14:25:23Z

felixT2K
Nov 12, 2024

Hi @Shamyukthaaa 👋🏼,

Currently there is no way to filter latin from non-latin unfortunately.
We have on track to provide multilingual support #1699 with a high priority and a white-/black-listing functionality #988

A lazy way would be to filter after the OCR results by it's confidence values which should be really low for non-latin and wrong detections.
Additional you could use https://github.com/pemistahl/lingua-py in a postprocessing manner to improve the filtering robustness.

Best,
Felix

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering Non-Latin Text During Detection to Improve OCR Accuracy in docTR #1774

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Filtering Non-Latin Text During Detection to Improve OCR Accuracy in docTR #1774

Shamyukthaaa Nov 12, 2024

Replies: 1 comment

felixT2K Nov 12, 2024

Shamyukthaaa
Nov 12, 2024

felixT2K
Nov 12, 2024