OpenClassrooms Python Developer Project #2: Use Python Basics for Market Analysis
Tested on Windows 10, Python 3.9.5.
Scraping of books.toscrape.com with BeautifulSoup4 and Requests, export data to .csv files and download cover images to the "exports" folder.
Implementation of the ETL process:
- Extract relevant and specific data from the source website;
- Transform, filter and clean data;
- Load data into searchable and retrievable files.
This project has been optimised after the end of the OpenClassrooms course. To view the previously delivered version, please check this commit.
Improvements made to this project include:
- Using OOP for the main scraper
- Optimising loops for faster execution time
- Parsing of command line arguments for options:
- Json export option
- Ignore images option
- One-file export option
- Progress bars (tqdm)
git clone https://github.com/hmignon/P2_mignon_helene.git
cd P2_mignon_helene
python -m venv env
- Activate the environment
source env/bin/activate
(macOS and Linux) orenv\Scripts\activate
(Windows)
pip install -r requirements.txt
To scrape the entirety of books.toscrape.com to .csv files,
use the command python main.py
.
Use python main.py --help
to view all options.
--categories
: Scrape one or several categories. This argument takes category names and/or full urls. For example, the 2 following commands would yield the same results:
main.py --categories travel
main.py --categories http://books.toscrape.com/catalogue/category/books/travel_2/index.html
To scrape a selection of categories, add selected names and/or urls separated by one space.
Note: selecting the same category several times (e.g. python main.py --categories travel travel
) will only export data once.
main.py --categories classics thriller
main.py --categories http://books.toscrape.com/catalogue/category/books/classics_6/index.html thriller
-c
or--csv
: Export data to .csv files.-j
or--json
: Export data to .json files.
Note: -j
and -c
can be used concurrently to export to both formats during the same scraping process.
--one-file
: Export all data to a single .csv/.json file.--ignore-covers
: Skip cover images downloads.
If you wish to open the exported .csv files in any spreadsheet software (Microsoft Excel, LibreOffice/OpenOffice Calc, Google Sheets...), please make sure to select the following options:
- UTF-8 encoding
- comma
,
as separator - double quote
"
as string delimiter