Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch PDF page rendering to an iterator format #1040

Closed
wants to merge 1 commit into from

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Sep 1, 2022

This PR:

  • reduces RAM Usage while pdf page rendering

Closses:
#1001

@felixdittrich92 felixdittrich92 linked an issue Sep 1, 2022 that may be closed by this pull request
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Sep 1, 2022
@felixdittrich92 felixdittrich92 added type: enhancement Improvement module: io Related to doctr.io labels Sep 1, 2022
@felixdittrich92 felixdittrich92 self-assigned this Sep 1, 2022
@codecov
Copy link

codecov bot commented Sep 1, 2022

Codecov Report

Merging #1040 (77fb85d) into main (75aa42a) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1040      +/-   ##
==========================================
- Coverage   94.93%   94.91%   -0.02%     
==========================================
  Files         135      135              
  Lines        5625     5626       +1     
==========================================
  Hits         5340     5340              
- Misses        285      286       +1     
Flag Coverage Δ
unittests 94.91% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/io/pdf.py 92.85% <100.00%> (+0.54%) ⬆️
doctr/io/reader.py 100.00% <100.00%> (ø)
doctr/transforms/functional/base.py 95.65% <0.00%> (-1.45%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -42,4 +42,5 @@ def read_pdf(

# Rasterise pages to PIL images with pypdfium2 and convert to numpy ndarrays
with pdfium.PdfDocument(file, password=password) as pdf:
return [np.asarray(img) for img in pdf.render_topil(scale=scale, **kwargs)]
for img in pdf.render_topil(scale=scale, **kwargs):
yield np.asarray(img)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a one-liner:

yield from pdf.render_topil(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh looks a bit strange 😅
yield from np.asarray(pdf.render_topil(...))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I see! Overlooked that in the hurry. I expect it'll be better once I have released v3 so that you can use render_tonumpy(), see #1032.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A got it would you link the discussion to the issue ? :) So i would close this PR until your release is finally

@mara004
Copy link
Contributor

mara004 commented Sep 1, 2022

While this PR changes read_pdf() itself to be an iterator, you'd also have to change all uses to actually benefit from it.
Always converting to a list kind of defeats the purpose of having an iterator in the first place.

@felixdittrich92
Copy link
Contributor Author

While this PR changes read_pdf() itself to be an iterator, you'd also have to change all uses to actually benefit from it. Always converting to a list kind of defeats the purpose of having an iterator in the first place.

Yeah that's correct... was a bit to fast with this

@felixdittrich92
Copy link
Contributor Author

Waiting until #1032

@mara004
Copy link
Contributor

mara004 commented Sep 1, 2022

No problem. I agree it's best to postpone this until I'm through with v3.

@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Sep 1, 2022

No problem. I agree it's best to postpone this until I'm through with v3.

Yeah no stress.. wrote with @charlesmindee today .. next release will be next month :) So enough time ^^

@mara004
Copy link
Contributor

mara004 commented Sep 1, 2022

Good to know. I want to do a thorough reversion of my codebase so I don't need to make another API-breaking release too soon. But I do expect I'll finish it within a month at the most.

@felixdittrich92 felixdittrich92 deleted the pdf-rendering branch September 1, 2022 21:11
@felixdittrich92 felixdittrich92 removed this from the 0.6.0 milestone Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: io Related to doctr.io type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pdf] Switch PDF page rendering to an iterator format
2 participants