Don't store resources content in memory #386

s0ph1e · 2020-01-02T11:39:54Z

Now all pages are stored in memory (each resource content is stored in Resource.text) which cause high memory consumption.
It would be nice to avoid storing Resource.text and save resourcess directly to FS just after they were received
Probably we can use streams for that

for html, css: Request -> update links/images/styles/etc. -> saveResource
all other types: Request -> saveResource when content modification is not needed

To do:

Update Resource class - get rid of text property and related functionality. Probably store reference to stream for resource
Update scraper mechanism: rework request/save functionality in scraper - replace requestQueue property with streamsQueue, replace requestedResourcePromises with requestResourceStreams or remove it, use streams instead of promises in request file
Check and update all actions that use Resource class objects - at least afterResponse, saveResource
Measure memory consumption of current implementation and streams implementation

Questions:

how to handle links to pages which are not downloaded yet? Can we set reference in parent before child is loaded? (see getReference action)

The text was updated successfully, but these errors were encountered:

Pomax · 2020-01-02T17:30:07Z

To underline why this might be important, I was trying to scape https://www.image-line.com/support/flstudio_online_manual/ and ended up with a 20GB footprint. So that would have crashed on many machines =)

beije · 2020-02-11T11:05:47Z

I'm having issues with this same problem, is there an ETA? :)

s0ph1e · 2020-02-14T17:09:22Z

Hi @beije
I do not have time to work on this project now so no ETA for nearest future. Contributions are welcome :)

Pomax · 2020-02-14T18:22:06Z

If you can write up what needs to be done, at least, then I'm sure someone would be willing to work on it. Even if there's no time to write code, there's always time to go "for those who want to work on this, you want to look in files A, B, and C, in the functions X, Y, and Z, because that's where U happens, which leads to V" =)

The original comment is already more than most project maintainers will drop in a "to do" issue, but for external contributions it just needs a little bit more to get folks started on helping out.

s0ph1e · 2020-02-18T20:41:12Z

Hey @Pomax

You are right, better to document what needs to be done. I've updated initial issue description with more information. Tricky part here is that it will be huge update - whole mechanism should be reworked - it looks quite complicated for me now. But there is also a good part - we have quite good test coverage that will help with updates :)

Pomax · 2020-02-19T00:25:41Z

oh dear, that does sound daunting... thank goodness for test coverage! And thank you for updating what needs to be done!

pavelloz · 2020-04-03T14:35:41Z

Well, to not throw everything on its head, maybe first step could be using standard saving file from buffer, just file by file, as a lifecycle event (almost like a plugin, but would need to have default saving disabled), without streaming. It would be much much better anyways, because keeping X (where X = concurrent connections) is much better than having all files in the memory at the same time, until they are dumped (at the end).

Additionally UX would improve, because first times i was running the script i was waiting for a long time, seeing nothing, because even output directory was not created until it was finished.

I didnt look at the codebase, im just brainstorming to hopefully push things forward even if its not ideal on first iteration.

s0ph1e · 2020-04-05T12:43:44Z

Hi @pavelloz

Just to clarify - not all files are stored in memory and saved only at the end. When resource has no dependencies - it's saved to directory immediately. And only resources with dependencies saved after all dependencies resolved and downloaded (for example, html file with 5 images will be saved after all images downloaded).

Thank you for suggestion.
Unfortunately I do not have time to work on that. Contributions are welcome

phawxby · 2022-06-20T21:30:49Z

I think we can actually solve the memory usage issue fairly easily.

handleResource already returns a promise, so all the handlers can be async.
Update the getText() and setText() properties of Resource async and have them read/write directly against the file system via promises. Basically everyone has SSD's, the bottlenecks will be on HTTP so IO will be a lot less of an issue.
Find some way to pass in a temporary cache directory for it to use. We could even create a new cache plugin which has 2 options, memory of filesystem.

s0ph1e added the enhancement label Jan 2, 2020

s0ph1e mentioned this issue Jan 2, 2020

write files as they are done, rather than "don't write until everything is done"? website-scraper/node-website-scraper-phantom#6

Closed

s0ph1e added the maybe-later label Feb 19, 2020

phawxby mentioned this issue Jun 20, 2022

fix: non-english char encoding #496

Merged

phawxby mentioned this issue Jun 21, 2022

feat: improve memory performance and custom container classes #497

Closed

s0ph1e added this to the 6.0.0 milestone Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't store resources content in memory #386

Don't store resources content in memory #386

s0ph1e commented Jan 2, 2020 •

edited

Loading

Pomax commented Jan 2, 2020 •

edited

Loading

beije commented Feb 11, 2020

s0ph1e commented Feb 14, 2020

Pomax commented Feb 14, 2020 •

edited

Loading

s0ph1e commented Feb 18, 2020

Pomax commented Feb 19, 2020

pavelloz commented Apr 3, 2020

s0ph1e commented Apr 5, 2020

phawxby commented Jun 20, 2022

Don't store resources content in memory #386

Don't store resources content in memory #386

Comments

s0ph1e commented Jan 2, 2020 • edited Loading

Pomax commented Jan 2, 2020 • edited Loading

beije commented Feb 11, 2020

s0ph1e commented Feb 14, 2020

Pomax commented Feb 14, 2020 • edited Loading

s0ph1e commented Feb 18, 2020

Pomax commented Feb 19, 2020

pavelloz commented Apr 3, 2020

s0ph1e commented Apr 5, 2020

phawxby commented Jun 20, 2022

s0ph1e commented Jan 2, 2020 •

edited

Loading

Pomax commented Jan 2, 2020 •

edited

Loading

Pomax commented Feb 14, 2020 •

edited

Loading