How do I fix one page that fails to OCR? #1242
Replies: 1 comment 1 reply
-
You can try I should probably change the default threshold for all cases sometime, but inevitably there are some images out there where the classic algorithm will be better. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(This didn't feel like an "issue", which is why I've asked it here as a discussion question instead.)
I'm using
ocrmypdf
on pdf scans of magazines. It mostly works brilliantly (thanks!).Except, I'm having problems with one particular page (which annoyingly, is one of just two in the 139 pdf pages I actually need the OCR for ...).
It has white text on a sky blue background and gets OCR'd as just a few bits of random text.
Curiously, the other (facing) page is the same white on blue and that OCRs fine (below, left).
Besides text, the "failing" page (below, right) also has a photo/graphics that occupies maybe half of the page and I suspected that is confusing
tesseract
.(obviously, these pics here are all very much reduced, just for illustration purposes, the files I'm working with are A4 scans, multi MB, high dpi etc etc)
I used
pdftk
to extract the troublesome page and ran it throughocrmypdf
with--keep-temporary-files
.debug.log
showed me thetesseract
command, which I invoked manually, with the addition of-c tessedit_write_images=1
, which producedtessinput.tif
for me (below)...... which confirms my suspicion that
tesseract
has become confused and looks to have used the wrong threshold to convert from colour to monochrome?If I manually edit the png to convert it to greyscale and invert it (i.e. now black text on grey background, below), tesseract then OCRs it just fine.
That gets me the text I need, but it would be kinda nice to put that correct OCR text into the full pdf too, instead of the failed OCR text that's currently there.
I can see two ways that might be possible...
i) use
ocrmypdf --tesseract-config
to givetesseract
some "magic" config (that presumably manually sets/adjusts the colour>monochrome threshold?) to make the OCR work (presumably, with something like--redo-ocr
and--pages
to just redo the one failing page).ii) somehow "insert"/"merge" the pdf (or text?) produced by
tesseract
from the manually manipulated (i.e. greyscale and inverted) page png into the full pdf... but I can't really see a way to do either of those?
Could someone give me some clues or point me in the right direction please? Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions