[Feature]: Option to remove OCR #1435

user1823 · 2024-11-24T10:51:45Z

Describe the proposed feature

Sometimes, I want to remove the OCR layer from a PDF. However, there is no good way of doing that yet.

Running gs -o out.pdf -sDEVICE=pdfwrite -dFILTERTEXT in.pdf works, but this sometimes increases the filesize (which is also mentioned in the GS docs). However, this is not desired because I just want to remove some information, in which case I would expect a reduction in file size.

I believe that OCRmyPDF already has a way of identifying the OCR text (which is necessary for --redo-ocr). So, implementing a feature to remove OCR should be simple and would fill a gap that is currently left by open-source PDF tools.

I have read the following documented limitation. But, this should not be a reason for not implementing the above-requested feature. We can simply document a similar limitation for the new feature too.

In some cases, existing OCR cannot be detected or replaced. Files produced by OCRmyPDF v2.2 or earlier, for example, are internally represented as having visible text with an opaque image drawn on top. This situation cannot be detected.

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-11-24T20:58:05Z

If the type of OCR you have is the type OCRmyPDF can detect then the odd combination --redo-ocr --tesseract-timeout 0 just might do it. Since it remove the OCR to some extent, add nothing back.

user1823 · 2024-11-25T09:40:31Z

This didn't work. I used OCRmyPDF to OCR a file and then use it to remove the OCR. However, the OCR layer was not removed.

I used these commands:

ocrmypdf --output-type pdf --max-image-mpixels 1000 --tesseract-downsample-above 3508 in.pdf ocr.pdf

ocrmypdf --output-type pdf --redo-ocr --tesseract-timeout 0 --optimize 0 ocr.pdf un_ocr.pdf

In case you are wondering, removing --optimize 0 didn't help either.

user1823 added enhancement triage Issue needs triage labels Nov 24, 2024

user1823 assigned jbarlow83 Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Option to remove OCR #1435

[Feature]: Option to remove OCR #1435

user1823 commented Nov 24, 2024 •

edited

Loading

jbarlow83 commented Nov 24, 2024

user1823 commented Nov 25, 2024 •

edited

Loading

[Feature]: Option to remove OCR #1435

[Feature]: Option to remove OCR #1435

Comments

user1823 commented Nov 24, 2024 • edited Loading

Describe the proposed feature

jbarlow83 commented Nov 24, 2024

user1823 commented Nov 25, 2024 • edited Loading

user1823 commented Nov 24, 2024 •

edited

Loading

user1823 commented Nov 25, 2024 •

edited

Loading