Thai Wikipedia dataset preparation

This page explains how to download, extract and filter and clean texts from Thai Wikipedia dump

Instruction

Download Thai Wikipedia dump with the following script (./scripts/download_thwiki_dump.sh)

bash ./scripts/download_thwiki_dump.sh \
20200820 \
./data/dataset/thwiki-20200820/1_dumps/

Example output:

Download thwiki-20201120-pages-articles.xml.bz2
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  276M  100  276M    0     0  1763k      0  0:02:40  0:02:40 --:--:-- 4010k

where 20200820 is the version of Thai Wikipedia dump (see more detail from dumps.wikimedia.org/thwiki/20200820)

Filename: thwiki-20201120-pages-articles.xml.bz2

SHA1: 1de130e11aa66c89b9ab0c73b5b6e739f423205b

Extract texts from downloaded dump (.bz2)

2.1 Install wikiextractor a tool to extract text segments from dump file.

bash ./scripts/install_wikiextractor.sh

2.2 Install faketime

apt-get update
apt-get install faketime

2.3 Extract texts from downloaded dump (.bz2) with wikiextractor via the following scripts (./scripts/extract_thwiki_dump.sh)

faketime '2020-08-25 12:00:00' bash ./scripts/extract_thwiki_dump.sh \
./data/dataset/thwiki-20200820/1_dumps/thwiki-20200820-pages-articles.xml.bz2 \
data/dataset/thwiki-20200820/2_extracted \
logs/wikiextractor_thwiki-20200820-nolist \
"--json --sections"

where the arguments are as follows:

DUMP_FILE_PATH - The path to the Wikipedia dump (.bz2)
OUTPUT_DIR - Directory to store the extracted data
LOG_PATH - Path to store the logging from wikiextractor
PARAMS - Additina parameters that will be passed to wikiextractor (e.g. --sections --json) (See more detail from this page: https://github.com/attardi/wikiextractor)

Example output:

Begin extracting thwiki dump from ./data/dataset/thwiki-20200820/1_dumps/thwiki-20200820-pages-articles.xml.bz2
INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from ./data/dataset/thwiki-20200820/1_dumps/thwiki-20200820-pages-articles.xml.bz2.
INFO: Using 1 extract processes.
INFO: 1	หน้าหลัก
INFO: 545	ดาราศาสตร์
INFO: 547	ภูมิศาสตร์
INFO: 611	พันทิป.คอม
INFO: 613	พันธุ์ทิพย์พลาซ่า
INFO: 615	วิทยาการคอมพิวเตอร์
INFO: 616	คณิตศาสตร์
INFO: 618	การประมวลสารสนเทศ
INFO: 619	การเมือง

...
...

INFO: 1119875	ประเทศไอซ์แลนด์ในโอลิมปิกเยาวชนฤดูร้อน 2014
INFO: 1119877	ประเทศอินโดนีเซียในโอลิมปิกเยาวชนฤดูร้อน 2014
INFO: 1119879	ประเทศอิรักในโอลิมปิกเยาวชนฤดูร้อน 2014
INFO: 1119880	ประเทศลัตเวียในโอลิมปิกเยาวชนฤดูร้อน 2014
INFO: 1119881	ประเทศโมร็อกโกในโอลิมปิกเยาวชนฤดูร้อน 2014
INFO: 1119882	ทีมผสมในโอลิมปิกเยาวชนฤดูร้อน 2014
INFO: 1119883	ผลกระทบกิบส์–ดอนนัน
INFO: Finished 79-process extraction of 139744 articles in 252.0s (554.6 art/s)
INFO: total of page: 264219, total of articl page: 139744; total of used articl page: 139744

Preprocess and aggregated into one file (.txt) with preprocess_thwiki_extracted.py

The script will perform text preprocessing as follows:

Remove article title if it is duplicated in the first paragraph of the article
(Optional) Remove first empty parenthesis
(Optional) Split long segments (both Thai, and English)
(Optional) Add end of document token
(Optional) Replace space token with a special token "<_>"

python ./scripts/preprocess_thwiki_extracted.py \
./data/dataset/thwiki-20200820/2_extracted \
./data/dataset/thwiki-20200820/3_aggregated \
--remove_first_empty_parenthesis \
--split_long_segment \
--add_end_of_doc_token \
--space_token "<_>"

Example output:

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Begin loading files from ./data/dataset/thwiki-20200820/2_extracted
Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AE
Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AC
Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AB
Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AA
Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AD
Sub directory: ./data/dataset/thwiki-20200820/2_extracted/AF
Total number of files: 586
Done.

Begin extracting data
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 586/586 [00:05<00:00, 102.66it/s]
139744it [00:02, 47093.23it/s]
Done.

Argumnet: remove_first_empty_parenthesis = True, Begin removing first empty parenthesis.
139744it [00:00, 173145.59it/s]
Done.

Argumnet: split_long_segment = True, Begin spliting long segment.
139744it [15:22, 151.55it/s]
Done.

Argumnet: add_end_of_doc_token = True, Begin adding end of document token `</s></s>`.
139744it [00:00, 500688.78it/s]
Done.

Begin replaceing space with space token
Argument space_token = <_>

The following script will write the output to this path: ./data/dataset/thwiki-20200820/3_aggregated/thwiki.txt

SHA1 of the output file, thwiki.txt: 8c32b81bca7256816f359bda0531262d9c1f825a

Clean data with the following script clean_data-thwiki.py

This script will apply two text cleaning rules:

Replace non-breaking space with a space token
Remove soft-hyphen and zero-width non-breaking space (invisible characters)

python ./scripts/clean_data-thwiki.py \
./data/dataset/thwiki-20200820/3_aggregated/thwiki.txt \
./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt

Example output:

Begin reading file from ./data/dataset/thwiki-20200820/3_aggregated/thwiki.txt
Done.

Apply text cleaning rule 1: Replace non-breaking space with space token.
Done.

Apply text cleaning rule 2: Remove invisible characters.
Done.

Begin writing file to ./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt

Split into train/val/test set via the script split_data.py

python ./scripts/split_data.py \
./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt \
./data/dataset/thwiki-20200820/5_split

Example output:

INFO: Load text file from ./data/dataset/thwiki-20200820/4_cleaned/thwiki.txt
INFO: Begin splitting data.
    train_ratio: 0.95
    val_ratio: 0.025
    test_ratio: 0.025

INFO: Train/val/test statistics.
    train set: 944782
    val set: 24863
    test set: 24862

INFO: Begin writing train split to "./data/dataset/thwiki-20200820/5_split/train/train.txt".
INFO: Begin writing val split to "./data/dataset/thwiki-20200820/5_split/val/val.txt".
INFO: Begin writing test split to "./data/dataset/thwiki-20200820/5_split/test/test.txt".

INFO: Done writing all split.

SHA1 for each file:

7fd01c8b5e90f4452ecdde1f92a75094e6187a78  test/test.txt
f4e472ecbea284ffd6ebb3766636e8508c8cfc10  train/train.txt
a472d1d7fa292f6e6b1f29e0afc64e94414bac44  val/val.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2_thwiki_data-preparation.md

2_thwiki_data-preparation.md

Thai Wikipedia dataset preparation

Instruction

Files

2_thwiki_data-preparation.md

Latest commit

History

2_thwiki_data-preparation.md

File metadata and controls

Thai Wikipedia dataset preparation

Instruction