OCR only the images? #1171

thibaultmol · 2023-10-19T21:02:39Z

thibaultmol
Oct 19, 2023

Hi, I have pdf's that are exports from PowerPoint. They contain the actual slides as images, and then the notes for each slide as actual text in the pdf.

I would like only process the images and keep the existing text. I assume ocrmypdf is converting the entire page to an image and then doing ocr on the entire page?
If so: what options do i need to use to only have it ocr the actual images inside the pdf's.

Answered by jbarlow83

Oct 19, 2023

--redo-ocr just might do what you need despite the name. It attempts to hide existing text (if any) and then OCR whatever is leftover.

View full answer

jbarlow83 · 2023-10-19T21:32:45Z

jbarlow83
Oct 19, 2023
Maintainer

--redo-ocr just might do what you need despite the name. It attempts to hide existing text (if any) and then OCR whatever is leftover.

2 replies

thibaultmol Oct 20, 2023
Author

Perfect! Exactly what I needed!
I'm so thankful for this tool existing, it's soo convenient and fast (thx to the auto thread count detection).

(Increased my sponsorship on OC but changed it from my personal account to company account)

thibaultmol Oct 20, 2023
Author

btw: search on readthedocs.io seems broken for me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR only the images? #1171

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

OCR only the images? #1171

thibaultmol Oct 19, 2023

Replies: 1 comment · 2 replies

jbarlow83 Oct 19, 2023 Maintainer

thibaultmol Oct 20, 2023 Author

thibaultmol Oct 20, 2023 Author

thibaultmol
Oct 19, 2023

Replies: 1 comment 2 replies

jbarlow83
Oct 19, 2023
Maintainer

thibaultmol Oct 20, 2023
Author

thibaultmol Oct 20, 2023
Author