dualscraper

Scrapes lat/long data from postcode.my, trying the Wayback Machine first, due to CloudFlare throttling.

The purpose of this repo is to serve as a template and demonstration of an MVP for scraping sites that employ CloudFlare throttling (presenting a Captcha if scraped too fast). To speed up the process, it tries the Wayback Machine first, only falling back to the live site if the page in question isn't on the Wayback Machine (or the data, in this case lat/long data, isn't there or looks definitely wrong).

While it took around four months to scrape all 56,894 urls on the live site, it took only a few days on the Wayback Machine.

The lat/long data

Here is the lat/long data, so you don't have to run the scraper yourself:

.csv file

(also has a column containing the url)

Installation

$ mkdir /path/to/project
$ cd /path/to/project
$ git clone [email protected]:Flurrywinde/dualscraper.git .
$ wget 'https://postcode.my/xml/listing_part1.xml.gz'
$ wget 'https://postcode.my/xml/listing_part2.xml.gz'
$ gunzip listing_part1.xml.gz
$ gunzip listing_part2.xml.gz

Usage

Initial run

$ cd /path/to/project
$ python dualscraper.py

Initial output is to output.csv. (Erased each run. See below.)

Since this will scrape more than 50,000 webpages with significant delays between them, this could take a long time. You can hit ctrl-c to abort at any time without losing data. (See below.)

Pick up where left off

The script can be stopped and restarted, picking up where it left off if the utility harvest is run in-between.

harvest appends output.csv to allsofar.csv. (It also copies it to a file like 0-11.csv containing only the current run's harvest.)

(Internally, the files startat.txt and laststartat.txt are used to track where you are. Don't mess with them unless you know what you are doing.)

Slow Non-Wayback Machine Mode

Since a site might've updated since being crawled by the Wayback Machine, once all data (from the site's sitemap .xml files) is obtained, use non-wayback mode to slowly re-scrape only the live site.

At the top of dualscraper.py, change trywayback to False to only scrape from the live site. (TODO: add command line parameters for things like this. Also, a config file.)

Running in non-wayback mode will create a sqlite database (located in ./postcode.my, which will be created if necessary) with all data in allsofar.csv. All updates from the re-scraping affect this database only (but a .csv file can be created from it. See below.).

Output a .csv file from the sql data

To output a sorted .csv file from the current database, run db2csv. This .csv file will be in ./postcode.my under the filename postcode-my.csv.

Errors found on the postcode.my site

In scraping this data, I found some errors and inconsistencies in the data. (This doesn't affect the accuracy of the current data.)

See: errors.md

TODO

Use pipreqs to make a requirements.txt file and add a Dependencies section to this readme.
- Note: dunst is an optional dependency to alert user when the captcha happens.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
postcode.my		postcode.my
.gitignore		.gitignore
README.md		README.md
check4.py		check4.py
db2csv		db2csv
dualscraper.py		dualscraper.py
errors.md		errors.md
get2.py		get2.py
harvest		harvest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dualscraper

The lat/long data

.csv file

Latest (as of March 2, 2024)

Old (has out-of-date data from the Wayback Machine)

sqlite database

Installation

Usage

Initial run

Pick up where left off

Slow Non-Wayback Machine Mode

Output a .csv file from the sql data

Errors found on the postcode.my site

TODO

About

Releases

Packages

Languages

Flurrywinde/dualscraper

Folders and files

Latest commit

History

Repository files navigation

dualscraper

The lat/long data

.csv file

Latest (as of March 2, 2024)

Old (has out-of-date data from the Wayback Machine)

sqlite database

Installation

Usage

Initial run

Pick up where left off

Slow Non-Wayback Machine Mode

Output a .csv file from the sql data

Errors found on the postcode.my site

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages