Evaluation results of boilerplate removal tools on two datasets: CleanEval and CleanPortalEval
CleanEval source:
- http://corpus.leeds.ac.uk/cleaneval/gold-24k/
- http://cleaneval.sigwac.org.uk/
- http://cleaneval.sigwac.org.uk/devset.html
CleanPortalEval source: https://github.com/ppke-nlpg/CleanPortalEval
Evaluation script from Stefan Evert (http://www.lrec-conf.org/proceedings/lrec2008)
- boilerpipe,
- bte,
- goldminer,
- goldminer+onion,
- justext,
- justext+onion
- cleanEvalResults: results of boilerplate removal algorithms on CleanEval dataset
- cleanPortalEvalResults: results of boilerplate removal algorithms on CleanPortalEval dataset
If you use the tool, please cite the following paper: More effective boilerplate removal - the GoldMiner algorithm
@article{endredy_more_2013,
title = {More {Effective} {Boilerplate} {Removal} - the {GoldMiner} {Algorithm}},
issn = {1870-9044},
url = {http://polibits.gelbukh.com/2013_48},
language = {eng},
number = {48},
journal = {Polibits - Research journal on Computer science and computer engineering with applications},
author = {Endr{\'e}dy, Istv{\'a}n and Nov{\'a}k, Attila},
year = {2013},
keywords = {boilerplate removal, Corpus building, the web as corpus},
pages = {79--83}
}