Simple, lightweight and expressive web scraping with Node.js
var scrapy = require('node-scrapy')
, url = 'https://github.com/strongloop/express'
, selector = '.repository-description'
scrapy.scrape(url, selector, function(err, data) {
if (err) return console.error(err)
console.log(data)
});
// 'Fast, unopinionated, minimalist web framework for node.'
Scrapy can resolve complex objects. Give it a data model:
var scrapy = require('node-scrapy')
, url = 'https://github.com/strongloop/express'
, model =
{ author: '.author',
repo: '.js-current-repository',
stats:
{ commits: '.commits .num',
branches: '.numbers-summary > li.commits + li .num',
releases: '.numbers-summary > li.commits + li + li .num',
contributors: '.numbers-summary > li.commits + li + li + li .num',
social:
{ stars: '.star-button + .social-count',
forks: '.fork-button + .social-count' } },
files: '.js-directory-link' }
scrapy.scrape(url, model, function(err, data) {
if (err) return console.error(err)
console.log(data)
});
...and Scrapy will return:
{ author: 'strongloop',
repo: 'express',
stats:
{ commits: '4,925',
branches: '12',
releases: '223',
contributors: '162',
social: { stars: '16,132', forks: '3,340' } },
files:
[ 'benchmarks', 'examples', 'lib', 'support', 'test', '.gitignore','.travis.yml', 'Contributing.md', 'History.md', 'LICENSE', 'Readme.md', 'index.js', 'package.json' ] }
npm install node-scrapy
🍠 Simple: No XPaths. No complex object inheritance. No extensive config files. Just JSON and the CSS selectors you're used to. Simple as potatoes.
⚡ Lightweight: Scrapy relies only on cheerio, request, and a Lo-Dash custom build, all known for being fast.
📢 Expressive: It's easy to talk to Scrapy. It will assume a lot of handy defaults to get what you actually meant to get. If Scrapy misunderstands, you can try to express yourself better using its options.
Scrapy wraps cheerio and request to parse HTML files over the wire. Cheerio can't parse javascript and neither will Scrapy, so with client-side-rendered pages Scrapy may not behave as one would expect. You can always check this visiting the page with your favorite browser and disabling javascript.
If the page you're trying to scrape is client-side-rendered, you still can change the HTTP user-agent to let the server know it is a machine and, if lucky, the server will return a non-AJAX version of the page. You may check this list of bots' user-agents and configure Scrapy through its request options to present itself as a bot.
So far, Scrapy exposes only one method:
A string
representing a valid URL of the resource to scrape.
It can be either a string
with the CSS selector of the element(s) to retrieve:
var url = 'https://www.npmjs.org/package/mocha'
, model = '.package-description'
scrapy.scrape(url, model, console.log)
// null 'simple, flexible, fun test framework'
// ^ no error passed to console.log
or an object
whose enumerable properties hold CSS selectors:
var url = 'https://www.npmjs.org/package/mocha'
, model = { description: '.package-description', keywords: 'h3:contains(Keywords) + p a' }
scrapy.scrape(url, model, console.log)
/*
{ description: 'simple, flexible, fun test framework',
keywords:
[ 'mocha',
'test',
'bdd',
'tdd',
'tap' ] }
*/
or nested objects with embeded options for each item, in which case the selector
key holding a CSS selector is a must:
var url = 'https://www.npmjs.org/package/mocha'
, model = { description: { selector: '.package-description', required: true },
maintainers:
{ selector: '.humans li a',
get: 'href',
prefix: 'https://www.npmjs.org' } }
scrapy.scrape(url, model, console.log)
/*
{ description: 'simple, flexible, fun test framework',
maintainers:
[ 'https://www.npmjs.org/~travisjeffery',
'https://www.npmjs.org/~tjholowaychuk',
'https://www.npmjs.org/~travisjeffery',
'https://www.npmjs.org/~jbnicolai',
'https://www.npmjs.org/~boneskull' ] }
*/
This is an optional Object
. It lets you set request's options, cheerio's load options, and/or your own default options for every item passed into the model
.
You can always look at Scrapy's defaults into the defaults.json file.
Important: the following options can be set in a per-item basis inside the model
. Setting these options into options.itemOptions
will simply overwrite the defaults used for the current .scrape()
call.
A string
representing a CSS selector. It must be compliant with CSSselector's supported selectors.
Part of the selected element(s) to retrieve.
'text'
: the DOM equivalent of Node.textContent
.
'{attribute}'
: gets the value of the given attribute
. e.g. 'src'
, 'href
', 'disabled'
, etc.
Default: 'text'
false
: nothing happens.
true
: Scrapy will stop and call back with an Error
as first argument if no element in the page matches the selector
. err.bodyString
holds the entire HTTP response body for debugging purposes.
Default: false
Heads up! - if no single element matched the selector
, the result will always be null
; except when required
is set to true
, in which case calls back with an Error
.
'auto'
: if a single element matched the selector
, a string
will be returned with its result. If many elements matched the selector, will return an Array
of strings holding the result of each element.
true
: will return a single string
, no matter if many elements matched the selector
. Only the first one will be taken.
false
: even if a single element matched the selector
, it will be returned boxed into an Array
.
Default: 'auto'
Trims the result, before any other tramsformation, like prefix
/suffix
.
false
: will not trim.
'left'
: trim left.
'right'
: trim right.
true
: will trim both sides.
Default: true
A function
applied after all other operations and transformations.
Default: function() { return this.toString(); }
A string
to be prefixed to the result(s). Useful to transform relative URLs into absolute ones.
Default: ''
(empty string)
A string
to be appended to the result(s).
Default: ''
(empty string)
These options are passed to cheerio on load. You can check all available options in htmlparser2's wiki (in which cheerio relies).
Scrapy's default cheerioOptions
are the following:
{
"normalizeWhitespace": true,
"xmlMode": false,
"lowerCaseTags": false
}
As a reminder: you can always look at Scrapy's defaults into the defaults.json file.
These options are passed directly to request's options.
Some useful options include: encoding: 'binary'
for old sites without character encoding declaration (try it if you're getting strange chars), authorization options (HTTP, Oauth, etc), proxies, SSL, cookies, among others.
A callback Function
that follows the NodeJS error-first callback convention.
function(err, data) {
if (err) return console.error(err) // Handle error
console.log(data) // Do something with data
}
Here some alternative nodejs-based solutions similar to node-scrapy (in popularity order):
Scrapy is in an early stage, we would love you to involve in its development! Go ahead and open a new issue.
❤ MIT