A simple, asynchronous web crawler written in Rust. This project demonstrates core web crawling functionality, including fetching web pages, parsing links, handling URL normalization, and storing crawled data. The crawler can be configured to start from any URL and is capable of limiting the crawl to a specific domain or depth.
- Asynchronous Crawling: Efficiently fetches multiple pages in parallel.
- Configurable Depth and Page Limits: Control how deep the crawler goes and the maximum number of pages to crawl.
- Domain Restriction: Optionally restricts crawling to URLs within the same domain as the start URL.
- Data Storage: Saves crawled data to both CSV and SQLite for easy data analysis.
- URL Validation and Normalization: Handles URL processing to avoid duplicate or invalid links.
- Visualization: Provides a tool to view crawled data directly from the SQLite database.
- Rust: Install Rust by following instructions at rust-lang.org.
- Dependencies: Run
cargo build
to install necessary dependencies, includingreqwest
,scraper
,csv
, andrusqlite
.
-
Clone this repository:
git clone cd rust-web-crawler
-
Build the project:
cargo build --release
Run the crawler from the command line, providing the necessary arguments:
cargo run -- --start-url "https://example.com" --depth-limit 3 --max-pages 100 --same-domain
--start-url
: (Required) The starting URL for the crawler.--depth-limit
: (Optional) Maximum depth of the crawl. Defaults to3
.--max-pages
: (Optional) Maximum number of pages to crawl.--same-domain
: (Optional) Restrict crawling to the starting domain.
-
Install a tool like DB Browser for SQLite:
- Download DB Browser for SQLite.
- Open the
output.db
file to browse the crawled data.
-
Or use the SQLite CLI:
sqlite3 output.db
Query data:
SELECT * FROM crawled_data;
- Open
output.csv
in a spreadsheet application like Microsoft Excel or LibreOffice Calc. - Analyze the rows of URLs and their corresponding HTML content.
Use the included visualization tool to display crawled data in a tabular format via the terminal:
-
Build and run the visualization binary:
cargo run --bin visualize
-
Output example:
ID URL HTML Content (truncated) -------------------------------------------------------------------------------- 1 http://example.com <html>Example</html> 2 http://example.com/page2 <html>Page 2</html>
Contributions are welcome! Feel free to submit issues, feature requests, or pull requests to improve this project.