You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes, I want to remove the OCR layer from a PDF. However, there is no good way of doing that yet.
Running gs -o out.pdf -sDEVICE=pdfwrite -dFILTERTEXT in.pdf works, but this sometimes increases the filesize (which is also mentioned in the GS docs). However, this is not desired because I just want to remove some information, in which case I would expect a reduction in file size.
I believe that OCRmyPDF already has a way of identifying the OCR text (which is necessary for --redo-ocr). So, implementing a feature to remove OCR should be simple and would fill a gap that is currently left by open-source PDF tools.
I have read the following documented limitation. But, this should not be a reason for not implementing the above-requested feature. We can simply document a similar limitation for the new feature too.
In some cases, existing OCR cannot be detected or replaced. Files produced by OCRmyPDF v2.2 or earlier, for example, are internally represented as having visible text with an opaque image drawn on top. This situation cannot be detected.
The text was updated successfully, but these errors were encountered:
If the type of OCR you have is the type OCRmyPDF can detect then the odd combination --redo-ocr --tesseract-timeout 0 just might do it. Since it remove the OCR to some extent, add nothing back.
Describe the proposed feature
Sometimes, I want to remove the OCR layer from a PDF. However, there is no good way of doing that yet.
Running
gs -o out.pdf -sDEVICE=pdfwrite -dFILTERTEXT in.pdf
works, but this sometimes increases the filesize (which is also mentioned in the GS docs). However, this is not desired because I just want to remove some information, in which case I would expect a reduction in file size.I believe that OCRmyPDF already has a way of identifying the OCR text (which is necessary for --redo-ocr). So, implementing a feature to remove OCR should be simple and would fill a gap that is currently left by open-source PDF tools.
I have read the following documented limitation. But, this should not be a reason for not implementing the above-requested feature. We can simply document a similar limitation for the new feature too.
The text was updated successfully, but these errors were encountered: