Skip to content

Latest commit

 

History

History
104 lines (74 loc) · 2.91 KB

README.md

File metadata and controls

104 lines (74 loc) · 2.91 KB

pyspider Build Status Coverage Status Try It Now!

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

  • Write script in python with powerful API
  • Python 2&3
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • Javascript pages supported!
  • MySQL, MongoDB, SQLite as database backend
  • Task priority, retry, periodical, recrawl by age and more
  • Distributed architecture

Documentation: http://docs.pyspider.org/

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Demo

Installation

Quickstart

Contribute

TODO

v0.3.0 (current)

  • as a package
  • run.py parameters
  • sortable projects list #12
  • Postgresql Supported via SQLAlchemy (with the power of SQLAlchemy, pyspider also support Oracle, SQL Server, etc)
  • benchmarking
  • python3 support
  • documents
  • tutorial
  • pypi release version

v0.4.0

  • local mode, load script from file.
  • works as a framework (all components running in one process, no threads)
  • shell mode like scrapy shell
  • a visual scraping interface like portia

more

License

Licensed under the Apache License, Version 2.0