-
-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve memory performance and custom container classes #497
feat: improve memory performance and custom container classes #497
Conversation
not finished yet
Hey @phawxby Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍 Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage. |
I guess we could have some people from #386 fire up this branch and see how they fair. |
@s0ph1e do you have any suggestions how to resolve the code climate issues? The limits feel very low to me. |
Nevermind, I figured out a way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started a review and briefly checked the source code (without tests). In short - I don't have major objections and I think after a few iterations we can finish and merge this changes.
Please find some comments and questions below and expect more comments to come soon after I check the tests and test the changes locally.
Thank you again for the PR and your patience 👍
@@ -83,15 +85,44 @@ How to download website to existing directory and why it's not supported by defa | |||
|
|||
#### sources | |||
Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources. Scraper uses cheerio to select html elements so `selector` can be any [selector that cheerio supports](https://github.com/cheeriojs/cheerio#selectors). | |||
|
|||
You can also specify custom `containerClass`', these are responsible for readying and writing from attributes. For example if you want to read JSON from an attribute... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containerClass
functionality looks like more advanced. I suggest to
- leave basic example at the beginning as it was before to make it easier to understand for majority of people who might not need advanced functionality
- and create another, separate, example with containerClass, maybe with close-to-real life example of html which may need such feature
@@ -57,6 +57,8 @@ scrape(options).then((result) => {}); | |||
* [urlFilter](#urlfilter) - skip some urls | |||
* [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource | |||
* [requestConcurrency](#requestconcurrency) - set maximum concurrent requests | |||
* [tempMode](#tempMode) - How to store data temporarily during processing | |||
* [tempDir](#tempMode) - The directory to use to store temp files when `tempMode === fs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that user should provide two directories - one for output, another for temporary files?
Do we still have the same behavior for - found an answer in codetempDir
- throwing an error if it exists and cleanup on error or after the finish?
Should we maybe leave only tmp directory generated inside scraper with fs.mkdtemp
? The reasons for that are:
- to avoid checking if passed directory already exists and to avoid accidental removing of previously existed directory with user data
- reduce a number of unnecessary options. I don't see cases when generated directory will be not sufficient and 2 directories (
directory
,tempDir
) may be a bit confusing. But if you see when it can be useful - please let me know, I'm open for a discission
@@ -63,7 +63,9 @@ const config = { | |||
recursive: false, | |||
maxRecursiveDepth: null, | |||
maxDepth: null, | |||
ignoreErrors: false | |||
ignoreErrors: false, | |||
tempMode: 'memory', // 'memory-compressed', 'fs' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we use 'fs' or 'filesystem'?
@@ -41,6 +41,10 @@ function throwTypeError (result) { | |||
} | |||
|
|||
function getData (result) { | |||
if (typeof result === 'string') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to check if the result has one of the supported types and throw an error if type is different instead of checking if the specific type is not supported. I think that will make code easier to understand for developers because we will have clear supported types at the beginning of the function
I mean something like
const resultType = typeof result;
if (resultType !== 'object') { /* throw an error */}
/* working with object */
// instead of
if (resultType === 'string') { /* throw an error */}
/* working with object */
this.tempMode = tempMode || 'memory'; | ||
this.tempDir = tempDir; | ||
|
||
if (this.tempMode === 'filesystem' && !this.tempDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like tempDir
is already set in lib/scraper.js
so we will have it defined at this point
} | ||
|
||
const inflate = promisify(zlib.inflate); | ||
const defalate = promisify(zlib.deflate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, defalate
-> deflate
* @param text - String to decompress. | ||
* @returns - Decompressed string. | ||
*/ | ||
async function decompress (buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does it work with string of with buffer? Could you please update argument name or jsdoc to make it more evident
* @returns - Compressed string. | ||
*/ | ||
async function compress (text) { | ||
return (await defalate(Buffer.from(text), { level: 6 })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Buffer.from(text)
here? It seems that deflate works also with strings, not only with buffers
"got": "^12.0.0", | ||
"lodash": "^4.17.21", | ||
"normalize-url": "^7.0.2", | ||
"p-queue": "^7.1.0", | ||
"sanitize-filename": "^1.6.3", | ||
"srcset": "^5.0.0" | ||
"srcset": "^5.0.0", | ||
"zlib": "^1.0.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using zlib
from npm instead of built-in node.js module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually did a bunch of testing on this a couple of years ago to best make use of our redis storage capacity. We determined that zlib overall had the best overall balance between compression ratio and compress/decompress performance at various data sizes. High performance read/write are needed and the data sizes are very similar to our use case internally so I used it on that basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On npm I can see that last version was published 11 years ago.
Also github repo for this package says it's deprecated
This extension is deprecated since the functionality was folded into node.js core: https://nodejs.org/dist/latest/docs/api/zlib.html
So for me it looks like built-in node module would be better here in order to avoid usage of deprecated packages.
value = path.normalize(value); | ||
if (process.platform == 'win32' && _.startsWith(value, path.sep)) { | ||
if (process.platform === 'win32' && _.startsWith(value, path.sep)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also replace _.startsWith
with native String's startsWith and get rid of lodash here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked tests - all look good, added few minor questions.
I haven't tested changes locally, plan to do it in the next few days and will get back
}); | ||
await scrape(options); | ||
|
||
(await fs.stat(testDirname + '/index.html')).isFile().should.be.eql(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use
await `${testDirname}/index.html`.should.fileExists(true);
here in the same way we have it in other tests?
}); | ||
await scrape(options); | ||
|
||
(await fs.stat(testDirname + '/index.html')).isFile().should.be.eql(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here - can we use .should.fileExists(true)
?
fs.existsSync(testDirname + '/index_1.html').should.be.eql(true); | ||
fs.existsSync(testDirname + '/google.png').should.be.eql(true); | ||
// should load css file and fonts from css file | ||
(await fs.stat(testDirname + '/css.css')).isFile().should.be.eql(true); // http://fonts.googleapis.com/css?family=Lato |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for checks in this file - can we use .should.fileExists(true)
?
I tested 3 different temp modes and can confirm that they improve memory usage ✅ To test the modes I created fake endless website with unique images and links on each page and tried to scrape it with script from docker container with memory limited to 64Mb
Another interesting idea to improve memory usage is to use smaller
|
Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain? |
Sadly not and I'm actually no longer employed with the organisation. Feel free to fork and fix the remaining issues and reopen the PR if you like as I definitely won't have time at the moment. Sorry. |
There seems to be no way to get the commits since the repo was deleted. When I pull down this repo and try to cherry-pick the SHAs in your PR, they can't be found. Github seems to have the data somewhere since the PR changes are still visible, but wherever they have the info, it's not in the actual git repo. Do you have access in any form to the repo? |
This closes #386 and is an extension of #496. This provides new options to store data in memory, in memory and compressed or on the filesystem. Unlikely #496 this would definitely require a major release as it makes significant changes to resource handling.
This PR also restructures a lot of tests and removes
fs-extra
as it's mostly not needed but also in our larger internal project we found it introduces a lot of compatibility issues for example withmemfs
which would be the best way of testing filesystem based tests, but this PR was big enough already.Edit: Last night I completed a scrape of 11 non-English websites in parallel using filesystem caching. No out of memory errors and total exported size of 4.2GB.