Given any Wikipedia dump, the simple tool parse the dump xml to get clean Wikipedia pages, build inverted indexing for them and return pages for simple one-word queries.
source buildIndex.sh $WIKIDUMP_XML_FILE
or
bzcat $WIKIDUMP_BZ2_FILE |python buildInvertedIndexFromWikiDump.py -o $INDEX_DIR
$WIKIDUMP_BZ2_FILE
are downloaded from
Wikipedia Dump, compressing the file $WIKIDUMP_XML_FILE
.
$INDEX_DIR
(default is wiki) is the directory for saving the Wikipedia pages and indices.
After running, there are severy subdirectories in in $INDEX_DIR. Each of them contains several xml and pkl files. XML file contains Wikipedia pages withtitles as docid. A PKL file contains the inverted index for corresponding XML file.
setup.sh
did the Step 1 and 2 together. After running it, you will get the indexing directoyr wiki in your
current working directory, then you can continue with Step 3.
python retrieveIndex.py $INDEX_DIR $Query
$INDEX_DIR
is generated in Step 1. $Query
is a single word.
buildInvertedIndexFromWikiDump.py
mainly includes two steps: wikidump parsing and index bulding. Method process_data()
reads the dump file line by line, and then aggregate texts between and as the raw text for a page,
which is passed to the method process_page(). It calls class WikiDocument.extract(..) to omit html garbages and
unrelated links with regular expression. The class OutputSplitter will write wikipediage pages into several files.
We keep the regular Wikipedia pages and filter other pages, e.g. redirect and disambigurate pages to decrease the number of pages for indexing. Let N be the total number of strings in regular Wikipedia, M be the number of regular expressions, the overall complexity is approximately O(N*M).
For each XML file, retrieveIndex.py
first build dictionaries of the title of a page to a set of its containing tokens,
then convert into dictionaries to a set of page titles. The ideal complexity of dictionary is O(1). Let K be the average
number of page frequency for a token, the complexity of retrieval is roughly O(K).
The class OutputSplitter splits the collection of Wikipedia pages into chuncks. It's very practical as it's impossible to indexing huge ammount of pages in one machine. If you manage to build such a big index, it's too big to retrieve effeciently and fast. In real world search engine, documents are splitted and kept on several servers. Queries can be retrieved parallelly and in real time as splitted indexing are much smaller.
Wikipedia dump can be parsed and cleaned in parrel. The Hadoop-based tool Wikipedia Miner serves the purpose better.
Word querying can be easied extend to phrase querying and more complicated structural query.
Ranking of retrieval results is feasible and practically. We can add term frequency in the indexing step. Unigram probability is calculated in the retrieval stage to rank the pages.