scrapysub

Here's a detailed documentation for ScrapySub to include in your README.md file for GitHub and PyPI:

scrapysub

ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.

Features

Recursive Scraping: Automatically follows and scrapes links within the same domain.
Custom User-Agent: Mimics browser requests to avoid being blocked by websites.
Error Handling: Retries failed requests and handles common HTTP errors.
Metadata Storage: Stores additional metadata about the scraped content.
Politeness: Adds a delay between requests to avoid overwhelming servers.

Installation

Install ScrapySub using pip:

pip install scrapysub

Usage

Here's a quick example to get you started with ScrapySub:

from scrapysub import ScrapWeb

# Initialize the scraper
scraper = ScrapWeb()

# Start scraping from the given URL
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)

# Get all the scraped documents
documents = scraper.get_all_documents()

# Print the content of each document
for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print the first 200 characters
    print()

Detailed Example

Importing Required Libraries

from scrapysub import ScrapWeb, Document

Initializing the Scraper

scraper = ScrapWeb()

Starting the Scraping Process

url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)

Accessing Scraped Documents

documents = scraper.get_all_documents()

for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print the first 200 characters
    print()

Class and Method Details

ScrapWeb Class

__init__(self): Initializes the scraper with a session and custom headers.
fetch_page(self, url): Fetches the HTML content of the given URL with retries and error handling.
scrape_text(self, html_content): Extracts visible text from the HTML content.
tag_visible(self, element): Helper method to filter out non-visible elements.
get_links(self, url, html_content): Finds all valid links on the page within the same domain.
is_valid_url(self, url, base_url): Checks if a URL is valid and belongs to the same domain.
scrape(self, url): Recursively scrapes the given URL and its subpages.
get_all_documents(self): Returns all scraped documents.

Document Class

__init__(self, page_content, **kwargs): Stores the text content and metadata of a web page.

Error Handling

ScrapySub handles common HTTP errors by retrying failed requests with a delay. If a request fails multiple times, it logs the error and continues with the next URL.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss your ideas.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or suggestions, feel free to reach out to the maintainer.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
build/lib/ScrapySub		build/lib/ScrapySub
dist		dist
src		src
tests		tests
LICENCE		LICENCE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapysub

Features

Installation

Usage

Detailed Example

Importing Required Libraries

Initializing the Scraper

Starting the Scraping Process

Accessing Scraped Documents

Class and Method Details

ScrapWeb Class

Document Class

Error Handling

Contributing

License

Contact

About

Releases

Packages

Languages

License

ENGRZULQARNAIN/ScrapySub

Folders and files

Latest commit

History

Repository files navigation

scrapysub

Features

Installation

Usage

Detailed Example

Importing Required Libraries

Initializing the Scraper

Starting the Scraping Process

Accessing Scraped Documents

Class and Method Details

ScrapWeb Class

Document Class

Error Handling

Contributing

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages