Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Option to remove OCR #1435

Open
user1823 opened this issue Nov 24, 2024 · 2 comments
Open

[Feature]: Option to remove OCR #1435

user1823 opened this issue Nov 24, 2024 · 2 comments
Assignees
Labels
enhancement triage Issue needs triage

Comments

@user1823
Copy link

user1823 commented Nov 24, 2024

Describe the proposed feature

Sometimes, I want to remove the OCR layer from a PDF. However, there is no good way of doing that yet.

Running gs -o out.pdf -sDEVICE=pdfwrite -dFILTERTEXT in.pdf works, but this sometimes increases the filesize (which is also mentioned in the GS docs). However, this is not desired because I just want to remove some information, in which case I would expect a reduction in file size.

I believe that OCRmyPDF already has a way of identifying the OCR text (which is necessary for --redo-ocr). So, implementing a feature to remove OCR should be simple and would fill a gap that is currently left by open-source PDF tools.


I have read the following documented limitation. But, this should not be a reason for not implementing the above-requested feature. We can simply document a similar limitation for the new feature too.

In some cases, existing OCR cannot be detected or replaced. Files produced by OCRmyPDF v2.2 or earlier, for example, are internally represented as having visible text with an opaque image drawn on top. This situation cannot be detected.

@jbarlow83
Copy link
Collaborator

If the type of OCR you have is the type OCRmyPDF can detect then the odd combination --redo-ocr --tesseract-timeout 0 just might do it. Since it remove the OCR to some extent, add nothing back.

@user1823
Copy link
Author

user1823 commented Nov 25, 2024

This didn't work. I used OCRmyPDF to OCR a file and then use it to remove the OCR. However, the OCR layer was not removed.

I used these commands:

ocrmypdf --output-type pdf --max-image-mpixels 1000 --tesseract-downsample-above 3508 in.pdf ocr.pdf

ocrmypdf --output-type pdf --redo-ocr --tesseract-timeout 0 --optimize 0 ocr.pdf un_ocr.pdf

In case you are wondering, removing --optimize 0 didn't help either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement triage Issue needs triage
Projects
None yet
Development

No branches or pull requests

2 participants