Here's a detailed documentation for ScrapySub
to include in your README.md
file for GitHub and PyPI:
ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.
- Recursive Scraping: Automatically follows and scrapes links within the same domain.
- Custom User-Agent: Mimics browser requests to avoid being blocked by websites.
- Error Handling: Retries failed requests and handles common HTTP errors.
- Metadata Storage: Stores additional metadata about the scraped content.
- Politeness: Adds a delay between requests to avoid overwhelming servers.
Install ScrapySub using pip:
pip install scrapysub
Here's a quick example to get you started with ScrapySub:
from scrapysub import ScrapWeb
# Initialize the scraper
scraper = ScrapWeb()
# Start scraping from the given URL
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)
# Get all the scraped documents
documents = scraper.get_all_documents()
# Print the content of each document
for doc in documents:
print(f"URL: {doc.metadata['url']}")
print(f"Content: {doc.page_content[:200]}...") # Print the first 200 characters
print()
from scrapysub import ScrapWeb, Document
scraper = ScrapWeb()
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)
documents = scraper.get_all_documents()
for doc in documents:
print(f"URL: {doc.metadata['url']}")
print(f"Content: {doc.page_content[:200]}...") # Print the first 200 characters
print()
__init__(self)
: Initializes the scraper with a session and custom headers.fetch_page(self, url)
: Fetches the HTML content of the given URL with retries and error handling.scrape_text(self, html_content)
: Extracts visible text from the HTML content.tag_visible(self, element)
: Helper method to filter out non-visible elements.get_links(self, url, html_content)
: Finds all valid links on the page within the same domain.is_valid_url(self, url, base_url)
: Checks if a URL is valid and belongs to the same domain.scrape(self, url)
: Recursively scrapes the given URL and its subpages.get_all_documents(self)
: Returns all scraped documents.
__init__(self, page_content, **kwargs)
: Stores the text content and metadata of a web page.
ScrapySub handles common HTTP errors by retrying failed requests with a delay. If a request fails multiple times, it logs the error and continues with the next URL.
Contributions are welcome! Please submit a pull request or open an issue to discuss your ideas.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or suggestions, feel free to reach out to the maintainer.