Python-Webscraper

Python Webscraping code for different kinds of websites including country restricted, javascript enabled sites.

This is a dynamic webscraper that crawls data from 4 different websites. One of them is a regualar website , another a dynamic website, another a website with country restrictions and one with anti bot html in the code.

I just made this as requirement of an interview assignment and this may not be fully developed yet, but I do intened to add to this in the future.

This is just a test program, always respect robots texts. And, my xpaths could stop working any moment. I am trying to figure a way to improve on that.

Added a threadpool for faster execution, and a timer function to run it every 5 minutes

Running the script

Uses Python 3.6
You will need selenium webdriver (install using pip) and geckodriver
Before running the script, edit the config.ini file and add a proxy ip and port. You can get one here: https://free-proxy-list.net/uk-proxy.html, if the connection is slow, increase the wait time.
Also in the config file, change the path to the results folder (If using windows, give the absolute path)
And of course, mozilla installed in your PC

What it does

This is just information for the evaluation. This information, and the xpaths will of course change soon.

The resulting matrix is printed on screen and also saved as a csv in the results folder.

There are 4 websites in the scripts that are scrawled.

WilliamHill

This is a website that can be scraped without using selenium, by using requests in python. Straightforward.

PaddyPower

This website uses country restrictions. The actual scrape was quite easy, but I needed to have firefox configured to use a seperate profile along with a proxy. And I think they also track the original IP of the request, as I keep getting blocked after some time. I still can't get it to run after the intial success. Refer: https://stackoverflow.com/questions/50320915/how-to-access-country-restricted-website-through-proxy-selenium-in-python

SkyBet

This is a javascript enabled website, which connects fine with the default selenium webdriver. However for the scraping, it had to been done using longer xpath as there were no identifiables classes or ids.

Bet365

For this website, I used a firefox driver with proxy enabled.

What I am working on

Trying to access all these websites just by using the homepage and then navigating my way to where I want to go. Some implementation is done on skybet, but not from the homepage. I also need to add comments

I think ideally I should be trying to scrape all the websites from one class by using text processing and multiple threads. I did not implement threads for the individual scraping component so far. I'll try to do this soon.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
results		results
.gitignore		.gitignore
Bet365.py		Bet365.py
PaddyPower.py		PaddyPower.py
README.md		README.md
SkyBet.py		SkyBet.py
Utilities.py		Utilities.py
Website.py		Website.py
WilliamHill.py		WilliamHill.py
config.ini		config.ini
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python-Webscraper

Running the script

What it does

WilliamHill

PaddyPower

SkyBet

Bet365

What I am working on

About

Releases

Packages

Languages

nisalup/Python-Webscraper

Folders and files

Latest commit

History

Repository files navigation

Python-Webscraper

Running the script

What it does

WilliamHill

PaddyPower

SkyBet

Bet365

What I am working on

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages