Disclaimer

Not for use in production.

This is just a prototype (proof of concept) for focused web-crawler using text classification algorithms.

Focer

Focer is Focused web-crawler based on StormCrawler. It was developed as part of Bachelor's thesis Development of focused web-crawler by Mārtiņš Trubiņš, Riga Technical university.

Requirements

Java 11
Apache Storm cluster (can be run in -local mode without Storm) tested with version 1.2.3
Apache Solr tested with version 8.5.1

Limitations

As mentioned before this just prototype, so limitations apply:

Only Latvian language is supported
Only HTML mime-type is supported

Classifiers

Weka trained models are used as classifiers. Crawler consists of two classifiers primary(binary) and secondary(multi-class).

Primary classifier is used to determin outliers.
Multi-class classifier is used to determine exact class of page before saving it to solr. Included classifier classify pages in following classes:
- Auto
- Culture
- Finance
- Lifestyle
- Politics
- Sports
- Technology

Run crawler

Crawler was tested on Ubuntu 18.04 VM.

Make sure you have Java 11 installed.
Configure Storm cluster
Configure Solr
Compile code with oneJar gradle task. This will copy all necessary libraries and scripts to output folder.
Configure config.yaml
- focer.resourceFolder - location of crawlers resource folder that contains classification models and queue. Default is located in project resources/resources.
- focer.solr - solr index url
- focer.maxNgramBinary and focer.maxNgramMulti - max n-gram size for binary and multi-class classifier.
- focer.binaryDocCount and focer.multiDocCount - document count in classifier training + testing datasets. This parameter is related to bug in Weka.
- focer.cleanDb - if set to true queue will be cleared on start-up and filled with seed urls. If set to false crawling will continue from last stop.
- focer.seeds - seed urls
- focer.blacklist - blacklisted domains. Will not be added to queue if found.
- Every other parameter is related to StormCrawler configuration.
Add your classification models to corresponding folder in resources folder. You also have to add prepared Weka dictionary generated with StringToWordVector filter and tokenized arff file. From arff file only structure is required, so everything after @data tag can be deleted. Or you can use default classifiers included with project.
If you want to run crawler in Storm cluster copy extlibs contents to {storm_home}/extlib folder.
Run startWithStorm.sh to run in Storm cluster or startLocal.sh to start locally without Storm cluster. If you are starting crawler in Storm cluster edit startWithStorm.sh script so it points to correct Storm folder.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
classifier-bolt		classifier-bolt
crawler-impl		crawler-impl
db-connection		db-connection
gradle/wrapper		gradle/wrapper
link-parser		link-parser
page-parser		page-parser
prepare-data		prepare-data
resources		resources
stormlibs		stormlibs
stuff		stuff
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

Not for use in production.

Focer

Requirements

Limitations

Classifiers

Run crawler

About

Releases

Packages

Languages

License

thatGreenFrog/Focer-WebCrawler

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Not for use in production.

Focer

Requirements

Limitations

Classifiers

Run crawler

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages