How do I fix one page that fails to OCR? #1242

aldennisa15 · 2024-01-28T17:47:15Z

aldennisa15
Jan 28, 2024

(This didn't feel like an "issue", which is why I've asked it here as a discussion question instead.)

I'm using ocrmypdf on pdf scans of magazines. It mostly works brilliantly (thanks!).

Except, I'm having problems with one particular page (which annoyingly, is one of just two in the 139 pdf pages I actually need the OCR for ...).

It has white text on a sky blue background and gets OCR'd as just a few bits of random text.

Curiously, the other (facing) page is the same white on blue and that OCRs fine (below, left).

Besides text, the "failing" page (below, right) also has a photo/graphics that occupies maybe half of the page and I suspected that is confusing tesseract.

(obviously, these pics here are all very much reduced, just for illustration purposes, the files I'm working with are A4 scans, multi MB, high dpi etc etc)

I used pdftk to extract the troublesome page and ran it through ocrmypdf with --keep-temporary-files.

debug.log showed me the tesseract command, which I invoked manually, with the addition of -c tessedit_write_images=1, which produced tessinput.tif for me (below)...

... which confirms my suspicion that tesseract has become confused and looks to have used the wrong threshold to convert from colour to monochrome?

If I manually edit the png to convert it to greyscale and invert it (i.e. now black text on grey background, below), tesseract then OCRs it just fine.

That gets me the text I need, but it would be kinda nice to put that correct OCR text into the full pdf too, instead of the failed OCR text that's currently there.

I can see two ways that might be possible...
i) use ocrmypdf --tesseract-config to give tesseract some "magic" config (that presumably manually sets/adjusts the colour>monochrome threshold?) to make the OCR work (presumably, with something like --redo-ocr and --pages to just redo the one failing page).
ii) somehow "insert"/"merge" the pdf (or text?) produced by tesseract from the manually manipulated (i.e. greyscale and inverted) page png into the full pdf
... but I can't really see a way to do either of those?

Could someone give me some clues or point me in the right direction please? Thanks in advance!

jbarlow83 · 2024-01-29T08:35:56Z

jbarlow83
Jan 29, 2024
Maintainer

You can try --tesseract-thresholding adaptive-otsu or sauvola, which will hopefully get a better intermediate image and avoid the need to merge. For your magazines, you can probably safely use one of those two as your default.

I should probably change the default threshold for all cases sometime, but inevitably there are some images out there where the classic algorithm will be better.

1 reply

aldennisa15 Jan 29, 2024
Author

Ah, ok thanks. The box I'm running this on is Debian 'oldstable' (Bullseye/v11) so has older (off-the-shelf) versions of ocrmypdf and tesseract...

# ocrmypdf --version;tesseract --version
10.3.1+dfsg
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8

... maybe this is the killer reason for needing to (finally) get it updated! It looks as though Bookworm/v12 should give me ocrmypdf v14.0.1 and tesseract v5.3.0. It looks as though --tesseract-thresholding appeared in v13.1.0.

Thanks for the pointer, I'll try that out (eventually... 😉)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I fix one page that fails to OCR? #1242

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How do I fix one page that fails to OCR? #1242

aldennisa15 Jan 28, 2024

Replies: 1 comment · 1 reply

jbarlow83 Jan 29, 2024 Maintainer

aldennisa15 Jan 29, 2024 Author

aldennisa15
Jan 28, 2024

Replies: 1 comment 1 reply

jbarlow83
Jan 29, 2024
Maintainer

aldennisa15 Jan 29, 2024
Author