Rust Web Crawler

A simple, asynchronous web crawler written in Rust. This project demonstrates core web crawling functionality, including fetching web pages, parsing links, handling URL normalization, and storing crawled data. The crawler can be configured to start from any URL and is capable of limiting the crawl to a specific domain or depth.

Features

Asynchronous Crawling: Efficiently fetches multiple pages in parallel.
Configurable Depth and Page Limits: Control how deep the crawler goes and the maximum number of pages to crawl.
Domain Restriction: Optionally restricts crawling to URLs within the same domain as the start URL.
Data Storage: Saves crawled data to both CSV and SQLite for easy data analysis.
URL Validation and Normalization: Handles URL processing to avoid duplicate or invalid links.
Visualization: Provides a tool to view crawled data directly from the SQLite database.

Installation

Prerequisites

Rust: Install Rust by following instructions at rust-lang.org.
Dependencies: Run cargo build to install necessary dependencies, including reqwest, scraper, csv, and rusqlite.

Setting Up

Clone this repository:
```
git clone 
cd rust-web-crawler
```
Build the project:
```
cargo build --release
```

Usage

Run the crawler from the command line, providing the necessary arguments:

cargo run -- --start-url "https://example.com" --depth-limit 3 --max-pages 100 --same-domain

Arguments:

--start-url: (Required) The starting URL for the crawler.
--depth-limit: (Optional) Maximum depth of the crawl. Defaults to 3.
--max-pages: (Optional) Maximum number of pages to crawl.
--same-domain: (Optional) Restrict crawling to the starting domain.

Visualizing the Output

Viewing Data in SQLite

Install a tool like DB Browser for SQLite:
- Download DB Browser for SQLite.
- Open the output.db file to browse the crawled data.

Or use the SQLite CLI:

sqlite3 output.db

Query data:

SELECT * FROM crawled_data;

Viewing Data in CSV

Open output.csv in a spreadsheet application like Microsoft Excel or LibreOffice Calc.
Analyze the rows of URLs and their corresponding HTML content.

Developer Tool: Data Visualization

Use the included visualization tool to display crawled data in a tabular format via the terminal:

Build and run the visualization binary:
```
cargo run --bin visualize
```

Output example:

ID    URL                                              HTML Content (truncated)
--------------------------------------------------------------------------------
1     http://example.com                               <html>Example</html>
2     http://example.com/page2                         <html>Page 2</html>

Contributions

Contributions are welcome! Feel free to submit issues, feature requests, or pull requests to improve this project.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
test		test
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rust Web Crawler

Features

Installation

Prerequisites

Setting Up

Usage

Arguments:

Visualizing the Output

Viewing Data in SQLite

Viewing Data in CSV

Developer Tool: Data Visualization

Contributions

About

Releases

Packages

Languages

zazabap/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Rust Web Crawler

Features

Installation

Prerequisites

Setting Up

Usage

Arguments:

Visualizing the Output

Viewing Data in SQLite

Viewing Data in CSV

Developer Tool: Data Visualization

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages