Filtering Non-Latin Text During Detection to Improve OCR Accuracy in docTR #1774
Shamyukthaaa
started this conversation in
General
Replies: 1 comment
-
Hi @Shamyukthaaa 👋🏼, Currently there is no way to filter latin from non-latin unfortunately. A lazy way would be to filter after the OCR results by it's confidence values which should be really low for non-latin and wrong detections. Best, |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I’m working on an OCR solution using docTR and am focusing on recognizing only English (Latin) text from images. In cases where images contain text in multiple languages (e.g., English and Hindi), I’m seeing higher recognition errors as the model tries to interpret non-Latin scripts as well. This results in false OCR outputs for text I’m not interested in.
Here’s an example image that illustrates the issue:
In this image, there’s English text ("SILENCE IS THE BEST ANSWER TO ANGER") as well as Hindi text ("क्रोध को जीतने में मौन सबसे अधिक सहायक है"). My goal is to configure the model to detect and recognize only the English text, ignoring any non-Latin scripts.
Currently, I’m using the following setup:
As per the documentation, the
detect_language
feature works at the page level but not the word level, which means I can’t filter out non-English words selectively.My Question: Is there a way to filter non-Latin text at the detection stage itself, so that only English (Latin) words are detected and sent to the recognition model? Ideally, I would like the detection model to ignore any regions with non-Latin scripts, which would reduce false OCR results.
For reference, I am specifically looking to filter for characters in:
Any guidance on achieving this or suggestions for potential workarounds would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions