This is just a prototype (proof of concept) for focused web-crawler using text classification algorithms.
Focer is Focused web-crawler based on StormCrawler. It was developed as part of Bachelor's thesis Development of focused web-crawler by Mārtiņš Trubiņš, Riga Technical university.
- Java 11
- Apache Storm cluster (can be run in -local mode without Storm) tested with version 1.2.3
- Apache Solr tested with version 8.5.1
As mentioned before this just prototype, so limitations apply:
- Only Latvian language is supported
- Only HTML mime-type is supported
Weka trained models are used as classifiers. Crawler consists of two classifiers primary(binary) and secondary(multi-class).
- Primary classifier is used to determin outliers.
- Multi-class classifier is used to determine exact class of page before saving it to solr. Included classifier classify pages in following classes:
- Auto
- Culture
- Finance
- Lifestyle
- Politics
- Sports
- Technology
Crawler was tested on Ubuntu 18.04 VM.
- Make sure you have Java 11 installed.
- Configure Storm cluster
- Configure Solr
- Compile code with
oneJar
gradle task. This will copy all necessary libraries and scripts tooutput
folder. - Configure
config.yaml
- focer.resourceFolder - location of crawlers resource folder that contains classification models and queue. Default is located in project resources/resources.
- focer.solr - solr index url
- focer.maxNgramBinary and focer.maxNgramMulti - max n-gram size for binary and multi-class classifier.
- focer.binaryDocCount and focer.multiDocCount - document count in classifier training + testing datasets. This parameter is related to bug in Weka.
- focer.cleanDb - if set to
true
queue will be cleared on start-up and filled with seed urls. If set tofalse
crawling will continue from last stop. - focer.seeds - seed urls
- focer.blacklist - blacklisted domains. Will not be added to queue if found.
- Every other parameter is related to StormCrawler configuration.
- Add your classification models to corresponding folder in resources folder. You also have to add prepared Weka dictionary generated with
StringToWordVector
filter and tokenized arff file. From arff file only structure is required, so everything after@data
tag can be deleted. Or you can use default classifiers included with project. - If you want to run crawler in Storm cluster copy extlibs contents to {storm_home}/extlib folder.
- Run startWithStorm.sh to run in Storm cluster or startLocal.sh to start locally without Storm cluster. If you are starting crawler in Storm cluster edit startWithStorm.sh script so it points to correct Storm folder.