-
-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve memory performance and custom container classes #497
Changes from all commits
5dfc172
4826c4d
55160b8
cfd5ac6
05b7dc4
c8c3798
1e1cacf
0ff0dac
8bfe5e9
b6c202a
a971fbe
26134d0
2a7072c
ea6d3b2
694a224
616c4ee
b0f0b38
5e89fb8
34ecd6a
ddf6e03
9497051
98ba8c7
f99ef27
4e5a903
1d730f3
f383322
12b34f8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,8 @@ scrape(options).then((result) => {}); | |
* [urlFilter](#urlfilter) - skip some urls | ||
* [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource | ||
* [requestConcurrency](#requestconcurrency) - set maximum concurrent requests | ||
* [tempMode](#tempMode) - How to store data temporarily during processing | ||
* [tempDir](#tempMode) - The directory to use to store temp files when `tempMode === fs` | ||
* [plugins](#plugins) - plugins, allow to customize filenames, request options, response handling, saving to storage, etc. | ||
|
||
Default options you can find in [lib/config/defaults.js](https://github.com/website-scraper/node-website-scraper/blob/master/lib/config/defaults.js) or get them using | ||
|
@@ -83,15 +85,44 @@ How to download website to existing directory and why it's not supported by defa | |
|
||
#### sources | ||
Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources. Scraper uses cheerio to select html elements so `selector` can be any [selector that cheerio supports](https://github.com/cheeriojs/cheerio#selectors). | ||
|
||
You can also specify custom `containerClass`', these are responsible for readying and writing from attributes. For example if you want to read JSON from an attribute... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
```javascript | ||
class JsonContainerClass { | ||
constructor (text) { | ||
this.text = text || ''; | ||
this.paths = []; | ||
|
||
if (this.text) { | ||
this.paths = JSON.parse(this.text); | ||
} | ||
} | ||
|
||
getPaths () { | ||
return this.paths; | ||
} | ||
|
||
updateText (pathsToUpdate) { | ||
this.paths = this.paths.map((oldPath) => { | ||
const toUpdate = pathsToUpdate.find((x) => x.oldPath === oldPath); | ||
|
||
return toUpdate ? toUpdate.newPath : oldPath; | ||
}); | ||
|
||
return JSON.stringify(this.paths); | ||
} | ||
} | ||
|
||
// Downloading images, css files and scripts | ||
scrape({ | ||
urls: ['http://nodejs.org/'], | ||
directory: '/path/to/save', | ||
sources: [ | ||
{selector: 'img', attr: 'src'}, | ||
{selector: 'link[rel="stylesheet"]', attr: 'href'}, | ||
{selector: 'script', attr: 'src'} | ||
{ selector: 'img', attr: 'src' }, | ||
{ selector: 'link[rel="stylesheet"]', attr: 'href' }, | ||
{ selector: 'script', attr: 'src' }, | ||
{ selector: 'div', attr: 'data-json', containerClass: JsonContainerClass } | ||
] | ||
}); | ||
``` | ||
|
@@ -199,6 +230,13 @@ scrape({ | |
#### requestConcurrency | ||
Number, maximum amount of concurrent requests. Defaults to `Infinity`. | ||
|
||
#### tempMode | ||
|
||
How to store temporary data when processing | ||
|
||
* `memory` - Data is store in memory in its raw format (default). | ||
* `memory-compressed` - Data is stored in memory but compressed using zlib. This is more memory efficient at the expense of CPU time spend compressing and decompressing. | ||
* `filesystem` - Data is stored in temporary files on the filesystem. This is the most memory efficient but it is strongly recommended to only use this mode with a solid state drive. | ||
|
||
#### plugins | ||
|
||
|
@@ -331,7 +369,6 @@ Promise should be resolved with: | |
* `body` (response body, string) | ||
* `encoding` (`binary` or `utf8`) used to save the file, binary used by default. | ||
* `metadata` (object) - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result. | ||
* a binary `string`. This is advised against because of the binary assumption being made can foul up saving of `utf8` responses to the filesystem. | ||
|
||
If multiple actions `afterResponse` added - scraper will use result from last one. | ||
```javascript | ||
|
@@ -430,7 +467,7 @@ If multiple actions `saveResource` added - resource will be saved to multiple st | |
```javascript | ||
registerAction('saveResource', async ({resource}) => { | ||
const filename = resource.getFilename(); | ||
const text = resource.getText(); | ||
const text = await resource.getText(); | ||
await saveItSomewhere(filename, text); | ||
}); | ||
``` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,7 +63,9 @@ const config = { | |
recursive: false, | ||
maxRecursiveDepth: null, | ||
maxDepth: null, | ||
ignoreErrors: false | ||
ignoreErrors: false, | ||
tempMode: 'memory', // 'memory-compressed', 'fs' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will we use 'fs' or 'filesystem'? |
||
tempDir: undefined | ||
}; | ||
|
||
export default config; |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,6 +41,10 @@ function throwTypeError (result) { | |
} | ||
|
||
function getData (result) { | ||
if (typeof result === 'string') { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest to check if the result has one of the supported types and throw an error if type is different instead of checking if the specific type is not supported. I think that will make code easier to understand for developers because we will have clear supported types at the beginning of the function I mean something like const resultType = typeof result;
if (resultType !== 'object') { /* throw an error */}
/* working with object */
// instead of
if (resultType === 'string') { /* throw an error */}
/* working with object */ |
||
throw new Error('afterResponse handler returned a string, expected object'); | ||
} | ||
|
||
let data = result; | ||
if (result && typeof result === 'object' && 'body' in result) { | ||
data = result.body; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,20 @@ | ||
import types from './config/resource-types.js'; | ||
import crypto from 'crypto'; | ||
import fs from 'fs/promises'; | ||
import path from 'path'; | ||
import { compress, decompress } from './utils/index.js'; | ||
|
||
class Resource { | ||
constructor (url, filename) { | ||
this.url = url; | ||
this.filename = filename; | ||
constructor (url, filename, tempMode, tempDir) { | ||
this.tempMode = tempMode || 'memory'; | ||
this.tempDir = tempDir; | ||
|
||
if (this.tempMode === 'filesystem' && !this.tempDir) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like |
||
throw new Error('tmpDir must be provided in tmpMode=filesystem'); | ||
} | ||
|
||
this.setUrl(url); | ||
this.setFilename(filename); | ||
|
||
this.type = null; | ||
this.depth = 0; | ||
|
@@ -16,7 +27,7 @@ class Resource { | |
} | ||
|
||
createChild (url, filename) { | ||
const child = new Resource(url, filename); | ||
const child = new Resource(url, filename, this.tempMode, this.tempDir); | ||
let currentDepth = this.getDepth(); | ||
|
||
child.parent = this; | ||
|
@@ -39,6 +50,12 @@ class Resource { | |
} | ||
|
||
setUrl (url) { | ||
if (this.tempDir) { | ||
// Generate a unique filename based on the md5 hash of the url | ||
const tmpName = `${crypto.createHash('md5').update(url).digest('hex')}.txt`; | ||
this.tempPath = path.join(this.tempDir, tmpName); | ||
} | ||
|
||
this.url = url; | ||
} | ||
|
||
|
@@ -50,12 +67,34 @@ class Resource { | |
this.filename = filename; | ||
} | ||
|
||
getText () { | ||
return this.text; | ||
async getText () { | ||
switch (this.tempMode) { | ||
case 'memory': | ||
return await this.text; | ||
case 'memory-compressed': | ||
return (await decompress(this.text)).toString(this.getEncoding()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need |
||
case 'filesystem': | ||
return await fs.readFile(this.tempPath, { encoding: this.getEncoding() }); | ||
default: | ||
throw new Error(`Unknown tempMode: ${this.tempMode}`); | ||
} | ||
} | ||
|
||
setText (text) { | ||
this.text = text; | ||
async setText (text) { | ||
switch (this.tempMode) { | ||
case 'memory': | ||
this.text = text; | ||
break; | ||
case 'memory-compressed': | ||
this.text = await compress(text); | ||
break; | ||
case 'filesystem': | ||
await fs.mkdir(this.tempDir, { recursive: true }); | ||
await fs.writeFile(this.tempPath, text, { encoding: this.getEncoding() }); | ||
break; | ||
default: | ||
throw new Error(`Unknown tempMode: ${this.tempMode}`); | ||
} | ||
} | ||
|
||
getDepth () { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that user should provide two directories - one for output, another for temporary files?
Do we still have the same behavior for- found an answer in codetempDir
- throwing an error if it exists and cleanup on error or after the finish?Should we maybe leave only tmp directory generated inside scraper with
fs.mkdtemp
? The reasons for that are:directory
,tempDir
) may be a bit confusing. But if you see when it can be useful - please let me know, I'm open for a discission