feat: improve memory performance and custom container classes #497

phawxby · 2022-06-21T14:58:12Z

This closes #386 and is an extension of #496. This provides new options to store data in memory, in memory and compressed or on the filesystem. Unlikely #496 this would definitely require a major release as it makes significant changes to resource handling.

This PR also restructures a lot of tests and removes fs-extra as it's mostly not needed but also in our larger internal project we found it introduces a lot of compatibility issues for example with memfs which would be the best way of testing filesystem based tests, but this PR was big enough already.

Edit: Last night I completed a scrape of 11 non-English websites in parallel using filesystem caching. No out of memory errors and total exported size of 4.2GB.

not finished yet

s0ph1e · 2022-06-22T13:12:39Z

Hey @phawxby

Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍
I definitely need more time to check it, I'll try to provide a review during next 1-2 weeks.

Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage.

phawxby · 2022-06-22T13:21:57Z

Hey @phawxby

Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍
I definitely need more time to check it, I'll try to provide a review during next 1-2 weeks.

Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage.

I guess we could have some people from #386 fire up this branch and see how they fair.

phawxby · 2022-06-23T14:12:45Z

@s0ph1e do you have any suggestions how to resolve the code climate issues? The limits feel very low to me.

phawxby · 2022-06-23T14:52:33Z

Nevermind, I figured out a way

s0ph1e

I've started a review and briefly checked the source code (without tests). In short - I don't have major objections and I think after a few iterations we can finish and merge this changes.

Please find some comments and questions below and expect more comments to come soon after I check the tests and test the changes locally.

Thank you again for the PR and your patience 👍

s0ph1e · 2022-07-09T20:01:43Z

README.md

@@ -83,15 +85,44 @@ How to download website to existing directory and why it's not supported by defa

 #### sources
 Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources. Scraper uses cheerio to select html elements so `selector` can be any [selector that cheerio supports](https://github.com/cheeriojs/cheerio#selectors).
+
+You can also specify custom `containerClass`', these are responsible for readying and writing from attributes. For example if you want to read JSON from an attribute...


containerClass functionality looks like more advanced. I suggest to

leave basic example at the beginning as it was before to make it easier to understand for majority of people who might not need advanced functionality

and create another, separate, example with containerClass, maybe with close-to-real life example of html which may need such feature

s0ph1e · 2022-07-09T20:35:11Z

README.md

@@ -57,6 +57,8 @@ scrape(options).then((result) => {});
 * [urlFilter](#urlfilter) - skip some urls
 * [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource
 * [requestConcurrency](#requestconcurrency) - set maximum concurrent requests
+* [tempMode](#tempMode) - How to store data temporarily during processing
+* [tempDir](#tempMode) - The directory to use to store temp files when `tempMode === fs`


Does it mean that user should provide two directories - one for output, another for temporary files?

~~Do we still have the same behavior for tempDir - throwing an error if it exists and cleanup on error or after the finish?~~ - found an answer in code

Should we maybe leave only tmp directory generated inside scraper with fs.mkdtemp? The reasons for that are:

to avoid checking if passed directory already exists and to avoid accidental removing of previously existed directory with user data

reduce a number of unnecessary options. I don't see cases when generated directory will be not sufficient and 2 directories (directory, tempDir) may be a bit confusing. But if you see when it can be useful - please let me know, I'm open for a discission

s0ph1e · 2022-07-09T20:37:06Z

lib/config/defaults.js

@@ -63,7 +63,9 @@ const config = {
 	recursive: false,
 	maxRecursiveDepth: null,
 	maxDepth: null,
-	ignoreErrors: false
+	ignoreErrors: false,
+	tempMode: 'memory', // 'memory-compressed', 'fs'


Will we use 'fs' or 'filesystem'?

s0ph1e · 2022-07-09T20:57:19Z

lib/request.js

@@ -41,6 +41,10 @@ function throwTypeError (result) {
 }

 function getData (result) {
+	if (typeof result === 'string') {


I suggest to check if the result has one of the supported types and throw an error if type is different instead of checking if the specific type is not supported. I think that will make code easier to understand for developers because we will have clear supported types at the beginning of the function

I mean something like

const resultType = typeof result; if (resultType !== 'object') { /* throw an error */} /* working with object */ // instead of if (resultType === 'string') { /* throw an error */} /* working with object */

s0ph1e · 2022-07-09T21:08:30Z

lib/resource.js

+		this.tempMode = tempMode || 'memory';
+		this.tempDir = tempDir;
+
+		if (this.tempMode === 'filesystem' && !this.tempDir) {


It looks like tempDir is already set in lib/scraper.js so we will have it defined at this point

s0ph1e · 2022-07-09T21:20:30Z

lib/utils/index.js

+}
+
+const inflate = promisify(zlib.inflate);
+const defalate = promisify(zlib.deflate);


typo, defalate -> deflate

s0ph1e · 2022-07-09T21:25:26Z

lib/utils/index.js

+ * @param text - String to decompress.
+ * @returns - Decompressed string.
+ */
+async function decompress (buffer) {


So does it work with string of with buffer? Could you please update argument name or jsdoc to make it more evident

s0ph1e · 2022-07-09T21:28:03Z

lib/utils/index.js

+ * @returns - Compressed string.
+ */
+async function compress (text) {
+	return (await defalate(Buffer.from(text), { level: 6 }));


Do we need Buffer.from(text) here? It seems that deflate works also with strings, not only with buffers

s0ph1e · 2022-07-09T21:32:12Z

package.json

    "got": "^12.0.0",
    "lodash": "^4.17.21",
    "normalize-url": "^7.0.2",
    "p-queue": "^7.1.0",
    "sanitize-filename": "^1.6.3",
-    "srcset": "^5.0.0"
+    "srcset": "^5.0.0",
+    "zlib": "^1.0.5"


Why are we using zlib from npm instead of built-in node.js module?

We actually did a bunch of testing on this a couple of years ago to best make use of our redis storage capacity. We determined that zlib overall had the best overall balance between compression ratio and compress/decompress performance at various data sizes. High performance read/write are needed and the data sizes are very similar to our use case internally so I used it on that basis.

On npm I can see that last version was published 11 years ago.
Also github repo for this package says it's deprecated

This extension is deprecated since the functionality was folded into node.js core: https://nodejs.org/dist/latest/docs/api/zlib.html

So for me it looks like built-in node module would be better here in order to avoid usage of deprecated packages.

s0ph1e · 2022-07-09T21:39:23Z

test/utils/assertions.js

 	value = path.normalize(value);
-	if (process.platform == 'win32' && _.startsWith(value, path.sep)) {
+	if (process.platform === 'win32' && _.startsWith(value, path.sep)) {


Could you please also replace _.startsWith with native String's startsWith and get rid of lodash here?

s0ph1e

Checked tests - all look good, added few minor questions.
I haven't tested changes locally, plan to do it in the next few days and will get back

s0ph1e · 2022-07-18T19:05:33Z

test/functional/request-response-customizations/request.test.js

-		});
+		await scrape(options);
+
+		(await fs.stat(testDirname + '/index.html')).isFile().should.be.eql(true);


Can we use

await `${testDirname}/index.html`.should.fileExists(true);

here in the same way we have it in other tests?

s0ph1e · 2022-07-18T19:06:16Z

test/functional/request-response-customizations/request.test.js

-		});
+		await scrape(options);
+
+		(await fs.stat(testDirname + '/index.html')).isFile().should.be.eql(true);


and here - can we use .should.fileExists(true) ?

s0ph1e · 2022-07-18T19:07:42Z

test/functional/resource-without-ext/resource-without-ext.test.js

-			fs.existsSync(testDirname + '/index_1.html').should.be.eql(true);
-			fs.existsSync(testDirname + '/google.png').should.be.eql(true);
+		// should load css file and fonts from css file
+		(await fs.stat(testDirname + '/css.css')).isFile().should.be.eql(true); // http://fonts.googleapis.com/css?family=Lato


Same question for checks in this file - can we use .should.fileExists(true) ?

s0ph1e · 2022-07-28T12:48:29Z

I tested 3 different temp modes and can confirm that they improve memory usage ✅

To test the modes I created fake endless website with unique images and links on each page and tried to scrape it with script from docker container with memory limited to 64Mb

with memory mode script was killed after ~50 resources requested (~70Mb), nothing was saved to fs
with memory-compressed mode scraper was able to save 142Mb+ to fs
with filesystem mode scraper was able to save 306Mb+ to fs

Another interesting idea to improve memory usage is to use smaller requestConcurrency. In my tests scripts with requestConcurrency: 10 were killed much faster than scripts with requestConcurrency: 5, which on their turn were killed faster than requestConcurrency: 1. Which makes sense because temporary request data are also saved to memory. Now concurrency is unlimited be default

node-website-scraper/lib/config/defaults.js

Line 61 in 34ecd6a

requestConcurrency: Infinity,

and I need to think if it should to be changed in next major version

scottmas · 2023-04-07T16:55:34Z

Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain?

phawxby · 2023-04-07T16:59:07Z

Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain?

Sadly not and I'm actually no longer employed with the organisation. Feel free to fork and fix the remaining issues and reopen the PR if you like as I definitely won't have time at the moment. Sorry.

scottmas · 2023-04-07T18:07:13Z

There seems to be no way to get the commits since the repo was deleted. When I pull down this repo and try to cherry-pick the SHAs in your PR, they can't be found. Github seems to have the data somewhere since the PR changes are still visible, but wherever they have the info, it's not in the actual git repo.

Do you have access in any form to the repo?

aivus · 2023-04-07T19:38:52Z

Hi @scottmas

You can follow this to checkout the commits

scottmas · 2023-04-08T21:38:42Z

Thanks @aivus! Resurrected it here: #526.

I don't have much context to know the changes that have been made recently or the best way to resolve the merge conflicts.

phawxby added 20 commits June 18, 2022 17:58

fix: non-english char encoding

5dfc172

fix: optional chaining not supported

4826c4d

fix: various tests

55160b8

not finished yet

fix: tests now pass

cfd5ac6

fix: testing and code complexity

05b7dc4

fix: code climate

c8c3798

fix: change comments

1e1cacf

fix: simplify code complexity

0ff0dac

refactor: simplify logic and add more testing

8bfe5e9

fix: reduce cognitive complexity

b6c202a

fix: more cognitive complexity

a971fbe

chore: damn space

26134d0

fix: broke a test

2a7072c

chore: update docs

ea6d3b2

refactor: overhaul of tests and file handling

694a224

feat: better memory usage

616c4ee

chore: update docs

b0f0b38

chore: update request test

5e89fb8

fix: various bug fixes for filesystem mode

34ecd6a

feat: add cleanup of temp dir

ddf6e03

phawxby mentioned this pull request Jun 22, 2022

fix: non-english char encoding #496

Merged

phawxby added 4 commits June 22, 2022 17:26

feat: add ability to specify custom container classes

9497051

test: fixes and improvements

98ba8c7

feat: throw error on string afterResponse BREAKING

f99ef27

Merge branch 'master' into PH-memory-footprint-2

4e5a903

phawxby added 2 commits June 23, 2022 15:30

refactor: break out functions to reduce complexity

1d730f3

chore: try to make code climate happy

f383322

phawxby marked this pull request as ready for review June 23, 2022 14:52

docs: container class

12b34f8

phawxby changed the title ~~feat: improve memory performance~~ feat: improve memory performance and custom container classes Jun 23, 2022

s0ph1e reviewed Jul 9, 2022

View reviewed changes

s0ph1e reviewed Jul 18, 2022

View reviewed changes

s0ph1e mentioned this pull request Aug 30, 2022

Use encoding from resource text #504

Merged

s0ph1e added this to the 6.0.0 milestone Sep 15, 2022

xeroxinteractive closed this by deleting the head repository Mar 28, 2023

scottmas mentioned this pull request Apr 8, 2023

Ph memory footprint 2 #526

Closed

aivus added the enhancement label Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve memory performance and custom container classes #497

feat: improve memory performance and custom container classes #497

phawxby commented Jun 21, 2022 •

edited

Loading

s0ph1e commented Jun 22, 2022

phawxby commented Jun 22, 2022

phawxby commented Jun 23, 2022

phawxby commented Jun 23, 2022

s0ph1e left a comment

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

s0ph1e Jul 9, 2022

phawxby Jul 13, 2022

s0ph1e Jul 18, 2022

s0ph1e Jul 9, 2022

s0ph1e left a comment

s0ph1e Jul 18, 2022

s0ph1e Jul 18, 2022

s0ph1e Jul 18, 2022

s0ph1e commented Jul 28, 2022

scottmas commented Apr 7, 2023

phawxby commented Apr 7, 2023

scottmas commented Apr 7, 2023

aivus commented Apr 7, 2023

scottmas commented Apr 8, 2023

feat: improve memory performance and custom container classes #497

feat: improve memory performance and custom container classes #497

Conversation

phawxby commented Jun 21, 2022 • edited Loading

s0ph1e commented Jun 22, 2022

phawxby commented Jun 22, 2022

phawxby commented Jun 23, 2022

phawxby commented Jun 23, 2022

s0ph1e left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s0ph1e left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s0ph1e commented Jul 28, 2022

scottmas commented Apr 7, 2023

phawxby commented Apr 7, 2023

scottmas commented Apr 7, 2023

aivus commented Apr 7, 2023

scottmas commented Apr 8, 2023

phawxby commented Jun 21, 2022 •

edited

Loading