This page explains how to download, extract and filter and clean texts from Thai Wikipedia dump
-
Download Thai Wikipedia dump with the following script (
./scripts/download_thwiki_dump.sh
)bash ./scripts/download_thwiki_dump.sh \ 20200820 \ ./data/dataset/thwiki-20200820/1_dumps/
Example output:
Download thwiki-20201120-pages-articles.xml.bz2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 276M 100 276M 0 0 1763k 0 0:02:40 0:02:40 --:--:-- 4010k
where
20200820
is the version of Thai Wikipedia dump (see more detail from dumps.wikimedia.org/thwiki/20200820)Filename: thwiki-20201120-pages-articles.xml.bz2
- SHA1: 1de130e11aa66c89b9ab0c73b5b6e739f423205b
-
Extract texts from downloaded dump (.bz2)
2.1 Install
wikiextractor
a tool to extract text segments from dump file.bash ./scripts/install_wikiextractor.sh
2.2 Install faketime
apt-get update apt-get install faketime
2.3 Extract texts from downloaded dump (.bz2) with
wikiextractor
via the following scripts (./scripts/extract_thwiki_dump.sh
)faketime '2020-08-25 12:00:00' bash ./scripts/extract_thwiki_dump.sh \ ./data/dataset/thwiki-20200820/1_dumps/thwiki-20200820-pages-articles.xml.bz2 \ data/dataset/thwiki-20200820/2_extracted \ logs/wikiextractor_thwiki-20200820-nolist \ "--json --sections"
where the arguments are as follows:
-
DUMP_FILE_PATH - The path to the Wikipedia dump (.bz2)
-
OUTPUT_DIR - Directory to store the extracted data
-
LOG_PATH - Path to store the logging from wikiextractor
-
PARAMS - Additina parameters that will be passed to
wikiextractor
(e.g.--sections --json
) (See more detail from this page: https://github.com/attardi/wikiextractor)
Example output:
Begin extracting thwiki dump from ./data/dataset/thwiki-20200820/1_dumps/thwiki-20200820-pages-articles.xml.bz2 INFO: Loaded 0 templates in 0.0s INFO: Starting page extraction from ./data/dataset/thwiki-20200820/1_dumps/thwiki-20200820-pages-articles.xml.bz2. INFO: Using 1 extract processes. INFO: 1 หน้าหลัก INFO: 545 ดาราศาสตร์ INFO: 547 ภูมิศาสตร์ INFO: 611 พันทิป.คอม INFO: 613 พันธุ์ทิพย์พลาซ่า INFO: 615 วิทยาการคอมพิวเตอร์ INFO: 616 คณิตศาสตร์ INFO: 618 การประมวลสารสนเทศ INFO: 619 การเมือง ... ... INFO: 1119875 ประเทศไอซ์แลนด์ในโอลิมปิกเยาวชนฤดูร้อน 2014 INFO: 1119877 ประเทศอินโดนีเซียในโอลิมปิกเยาวชนฤดูร้อน 2014 INFO: 1119879 ประเทศอิรักในโอลิมปิกเยาวชนฤดูร้อน 2014 INFO: 1119880 ประเทศลัตเวียในโอลิมปิกเยาวชนฤดูร้อน 2014 INFO: 1119881 ประเทศโมร็อกโกในโอลิมปิกเยาวชนฤดูร้อน 2014 INFO: 1119882 ทีมผสมในโอลิมปิกเยาวชนฤดูร้อน 2014 INFO: 1119883 ผลกระทบกิบส์–ดอนนัน INFO: Finished 79-process extraction of 139744 articles in 252.0s (554.6 art/s) INFO: total of page: 264219, total of articl page: 139744; total of used articl page: 139744
-
-
Preprocess and aggregated into one file (.txt) with
preprocess_thwiki_extracted.py
The script will perform text preprocessing as follows:
-
Remove article title if it is duplicated in the first paragraph of the article
-
(Optional) Remove first empty parenthesis
-
(Optional) Split long segments (both Thai, and English)
-
(Optional) Add end of document token
-
(Optional) Replace space token with a special token
"<_>"
python ./scripts/preprocess_thwiki_extracted.py \ ./data/dataset/thwiki-20200820/2_extracted \ ./data/dataset/thwiki-20200820/3_aggregated \ --remove_first_empty_parenthesis \ --split_long_segment \ --add_end_of_doc_token \ --space_token "<_>"
Example output:
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. Begin loading files from ./data/dataset/thwiki-20200820/2_extracted Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AE Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AC Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AB Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AA Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AD Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AF Total number of files: 586 Done. Begin extracting data 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 586/586 [00:05<00:00, 102.66it/s] 139744it [00:02, 47093.23it/s] Done. Argumnet: remove_first_empty_parenthesis = True, Begin removing first empty parenthesis. 139744it [00:00, 173145.59it/s] Done. Argumnet: split_long_segment = True, Begin spliting long segment. 139744it [15:22, 151.55it/s] Done. Argumnet: add_end_of_doc_token = True, Begin adding end of document token `</s></s>`. 139744it [00:00, 500688.78it/s] Done. Begin replaceing space with space token Argument space_token = <_>
The following script will write the output to this path:
./data/dataset/thwiki-20200820/3_aggregated/thwiki.txt
SHA1 of the output file,
thwiki.txt
: 8c32b81bca7256816f359bda0531262d9c1f825a -
-
Clean data with the following script
clean_data-thwiki.py
This script will apply two text cleaning rules:
-
Replace non-breaking space with a space token
-
Remove soft-hyphen and zero-width non-breaking space (invisible characters)
python ./scripts/clean_data-thwiki.py \ ./data/dataset/thwiki-20200820/3_aggregated/thwiki.txt \ ./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt
Example output:
Begin reading file from ./data/dataset/thwiki-20200820/3_aggregated/thwiki.txt Done. Apply text cleaning rule 1: Replace non-breaking space with space token. Done. Apply text cleaning rule 2: Remove invisible characters. Done. Begin writing file to ./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt
-
-
Split into train/val/test set via the script
split_data.py
python ./scripts/split_data.py \ ./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt \ ./data/dataset/thwiki-20200820/5_split
Example output:
INFO: Load text file from ./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt INFO: Begin splitting data. train_ratio: 0.95 val_ratio: 0.025 test_ratio: 0.025 INFO: Train/val/test statistics. train set: 944782 val set: 24863 test set: 24862 INFO: Begin writing train split to "./data/dataset/thwiki-20200820/5_split/train/train.txt". INFO: Begin writing val split to "./data/dataset/thwiki-20200820/5_split/val/val.txt". INFO: Begin writing test split to "./data/dataset/thwiki-20200820/5_split/test/test.txt". INFO: Done writing all split.
SHA1 for each file:
7fd01c8b5e90f4452ecdde1f92a75094e6187a78 test/test.txt f4e472ecbea284ffd6ebb3766636e8508c8cfc10 train/train.txt a472d1d7fa292f6e6b1f29e0afc64e94414bac44 val/val.txt