Switch PDF page rendering to an iterator format #1040

felixdittrich92 · 2022-09-01T19:48:33Z

This PR:

reduces RAM Usage while pdf page rendering

Closses:
#1001

codecov · 2022-09-01T20:27:07Z

Codecov Report

Merging #1040 (77fb85d) into main (75aa42a) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1040      +/-   ##
==========================================
- Coverage   94.93%   94.91%   -0.02%     
==========================================
  Files         135      135              
  Lines        5625     5626       +1     
==========================================
  Hits         5340     5340              
- Misses        285      286       +1

Flag	Coverage Δ
unittests	`94.91% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/io/pdf.py	`92.85% <100.00%> (+0.54%)`	⬆️
doctr/io/reader.py	`100.00% <100.00%> (ø)`
doctr/transforms/functional/base.py	`95.65% <0.00%> (-1.45%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

mara004 · 2022-09-01T20:38:00Z

doctr/io/pdf.py

@@ -42,4 +42,5 @@ def read_pdf(

    # Rasterise pages to PIL images with pypdfium2 and convert to numpy ndarrays
    with pdfium.PdfDocument(file, password=password) as pdf:
-        return [np.asarray(img) for img in pdf.render_topil(scale=scale, **kwargs)]
+        for img in pdf.render_topil(scale=scale, **kwargs):
+            yield np.asarray(img)


May be a one-liner:

yield from pdf.render_topil(...)

Mh looks a bit strange 😅
yield from np.asarray(pdf.render_topil(...))

Right, I see! Overlooked that in the hurry. I expect it'll be better once I have released v3 so that you can use render_tonumpy(), see #1032.

A got it would you link the discussion to the issue ? :) So i would close this PR until your release is finally

mara004 · 2022-09-01T20:40:18Z

While this PR changes read_pdf() itself to be an iterator, you'd also have to change all uses to actually benefit from it.
Always converting to a list kind of defeats the purpose of having an iterator in the first place.

felixdittrich92 · 2022-09-01T20:57:10Z

While this PR changes read_pdf() itself to be an iterator, you'd also have to change all uses to actually benefit from it. Always converting to a list kind of defeats the purpose of having an iterator in the first place.

Yeah that's correct... was a bit to fast with this

felixdittrich92 · 2022-09-01T20:58:42Z

Waiting until #1032

mara004 · 2022-09-01T20:59:23Z

No problem. I agree it's best to postpone this until I'm through with v3.

felixdittrich92 · 2022-09-01T21:04:12Z

No problem. I agree it's best to postpone this until I'm through with v3.

Yeah no stress.. wrote with @charlesmindee today .. next release will be next month :) So enough time ^^

mara004 · 2022-09-01T21:08:18Z

Good to know. I want to do a thorough reversion of my codebase so I don't need to make another API-breaking release too soon. But I do expect I'll finish it within a month at the most.

Switch PDF page rendering to an iterator format

77fb85d

felixdittrich92 linked an issue Sep 1, 2022 that may be closed by this pull request

[pdf] Switch PDF page rendering to an iterator format #1001

Closed

felixdittrich92 added this to the 0.6.0 milestone Sep 1, 2022

felixdittrich92 added type: enhancement Improvement module: io Related to doctr.io labels Sep 1, 2022

felixdittrich92 requested review from aminemindee, charlesmindee, frgfm and odulcy-mindee September 1, 2022 19:49

felixdittrich92 self-assigned this Sep 1, 2022

mara004 reviewed Sep 1, 2022

View reviewed changes

felixdittrich92 removed request for frgfm, charlesmindee, aminemindee and odulcy-mindee September 1, 2022 20:57

felixdittrich92 closed this Sep 1, 2022

felixdittrich92 deleted the pdf-rendering branch September 1, 2022 21:11

felixdittrich92 removed this from the 0.6.0 milestone Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch PDF page rendering to an iterator format #1040

Switch PDF page rendering to an iterator format #1040

felixdittrich92 commented Sep 1, 2022 •

edited

Loading

codecov bot commented Sep 1, 2022

mara004 Sep 1, 2022

felixdittrich92 Sep 1, 2022

mara004 Sep 1, 2022

felixdittrich92 Sep 1, 2022

mara004 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022

mara004 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022 •

edited

Loading

mara004 commented Sep 1, 2022

Switch PDF page rendering to an iterator format #1040

Switch PDF page rendering to an iterator format #1040

Conversation

felixdittrich92 commented Sep 1, 2022 • edited Loading

codecov bot commented Sep 1, 2022

Codecov Report

mara004 Sep 1, 2022

Choose a reason for hiding this comment

felixdittrich92 Sep 1, 2022

Choose a reason for hiding this comment

mara004 Sep 1, 2022

Choose a reason for hiding this comment

felixdittrich92 Sep 1, 2022

Choose a reason for hiding this comment

mara004 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022

mara004 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022 • edited Loading

mara004 commented Sep 1, 2022

felixdittrich92 commented Sep 1, 2022 •

edited

Loading

felixdittrich92 commented Sep 1, 2022 •

edited

Loading