Simple Node.js website crawler that users either the ?SHOWXML or the sitemap.xml of a site to crawler through looking for blank pages, and other bad status codes.
To install node package dependences run this in the directory that Birddog sits.
npm install package.json
node birddog.js
Runs using default options in Birddog.js
node birddog.js --url https://www.mandarinoriental.com
This will look for and run against the sitemap.xml file
node birddog.js --url https://fontainebleau.com/ --sitemap false
node birddog.js --u https://www.mandarinoriental.com/ --s false
node birddog.js --d https://fontainebleau.com/fontainebleau-miami-beach-xml-sitemap.xml
In the event that the sitemap.xml file isn't named as such.
Options | Type | Default | Description |
---|---|---|---|
url | string | https://sabreshospitality.com | The website url that you would like to crawl. Has alias *-u* |
directpath | string | https://sabreshospitality.com/sitemap.xml | The direct sitemap xml path that you would like to crawl. Only supports supports the [standard XML sitemap protocol]((https://www.sitemaps.org/index.html)). Has alias *-d* |
sitemap | boolean | true | If true uses sitemap.xml, if false uses ?SHOWXML for CDE sites. Has alias *-s* |
maxConnections | integer | 10 | Crawler.js option: Size of the worker pool. Has alias *-m* |
retries | number | 3 | Crawler.js option: Number of retries if the request fails. Has alias *-r* |
- cheerio v1.0.0-rc.2 -- Tiny, fast, and elegant implementation of core jQuery designed specifically for the server
- cli-spinner v0.2.8 -- Spinners for use in the terminal
- crawler v1.1.2 -- Crawler is a web spider written with Nodejs.
- minimist v0.0.8 -- Parse argument options
- request v2.83.0 -- Simplified HTTP request client.
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. More info