Word-alignment models for Bible translations in 100+ historical and contemporary languages
-
Installation and dependencies:
-
Download or clone the repository:
$ git clone https://github.com/npedrazzini/parallelbibles
-
From the root directory (./parallelbibles), build the repository:
$ make
This will download and build SyMGIZA++ [1] and install all the required dependencies in a venv called parallels-venv.
-
-
XML files, which can be of two formats:
-
OPUS (untokenized) (from https://opus.nlpl.eu/bible-uedin.php)
-
PROIEL (from https://proiel.github.io)
This repository comes with OPUS XMLs (inside original-xmls/opus-xmls) and PROIEL XMLs for New Testament Greek, Old Church Slavonic and Gothic (inside original-xmls/proiel-xmls).
-
This repository already comes with four pre-trained models. Check them out!
$ ./train.sh
This step will:
- convert OPUS/PROIEL XML files to GIZA-readable CSV files
- train a word-alignment model for each target language
- make GIZA's outputs easily readable and queryable
You will be prompted to:
- specify the input XML format (OPUS, PROIEL, or mixed)
- enter the desired source language
- enter the target languages (or have all the remaining as targets)
- specify if you want to strip punctuation
- specify if you want to bring everything to lowercase
- provide a name for your model
NB: the chosen languages must be entered in their ISO 639-3 code. See here for the complete list and the table below for the languages included in the models.
$ ./extract.sh
This step will:
- extract every occurrence of a word (or multiple words) in the source language and its translation in the target languages.
- (optionally) generate scripts to run multidimensional scaling (MDS) on the dataset and Kriging (to draw lines around clusters probabilstically)
You will be prompted to enter:
- the name of the model you want to use (e.g. 'model2-LC-NP')
- a target word (e.g. 'when') or multiple target words separated by hyphen (e.g. 'when-while-since')
- whether you want to generate the scripts necessary to run MDS on the dataset ('yes' or 'no')
- whether you also want to apply Kriging to the MDS maps ('yes' or 'no')
- whether you only want to extract words from the New Testament ('yes') or from both the Old and the New Testament ('no') *
The output will be a folder named as the target word (or words, hyphen-separated, if extracting multiple words at once) containing the following:
- word.csv: CSV file for each word. The file will contain one occurrence per line, its citation (Bible verse), context, and the translations in each target language **.
And if you chose to run MDS (with and without Kriging) it will also contain:
- word-MDS.R: an R script to run MDS (and Kriging, if you chose to), generating a single PDF with one map per language. These maps are static and generated using base R. Best for distant-reading stages in the data exploration ***.
- word-plotly.R: an R script (alternative to word-MDS.R) generating multiple HTML files using the R package plotly. These maps are interactive and let you hover over the data points and look at the citation (Bible verse) and source word in context. Best for close-reading stages in the data exploration.
- word-data.txt: the original data in TXT format and the citation (Bible verse) as index (rather than column, as in word.csv) and without the 'context' column.
- word-matrix.txt: distance matrix between source word and target words.
* This is because many languages lack the whole or large sections of the Old Testament, which will result in your dataset having many NAs (which you may or may not want to avoid).
** NB: NULL will indicate that the model did not find a match for the word in the target language. NA will indicate that the target language did not have a Bible translations of that particular verse in the first place (e.g. some languages lack a translation for the whole Old Testament).
*** NB 1: This script is a heavy adaptation of the code by [2]. NB 2: The lmap
function relies on the R package qlcVisualize. If you have issues installing it, simply save the two functions we need from that package by running the script ./scripts/postprocessing/lmap-boundary-functions.R included in this repository. NB 3: The MDS script has been adapted so that it merges all translations with less than 10 occurrences with NULLs. The '10' threshold is arbitrary and was based on what seemed to be a common cut-off point between 'real' translations in the target language and casual correspondence between the source word and a specific lexical item in the target language.
./scripts/postprocessing/splitstree.R
: this script will perform hierarchical clustering and NeighborNet analysis of the languages based on a criterion x (default: NULL-constructions).
It takes as input the file word-data.txt described above.
The script will:
- Plot a simple hierarchical cluster of the languages in a parallel-word dataset. It currently shows how similar languages appear to be based on NULL-construction distributions.
- Generate a Nexus (.nex) file for NeighborNet analysis, to be visualized with the SplitsTree4 software. Similar to a traditional hierarchical cluster in many ways, a NeighborNet will simply not force a binary-tree type of classification.
NB: model2-LC-NP is stored in this repo using Git LFS. If you wish to use that model, you should have Git LFS installed, else you will only see a pointer file.
Four pretrained models currently come with this repository:
- model1-UC-P: Upper case and with Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
- model2-LC-NP: Lower Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
- model3-UC-NP: Upper Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
- model4-LC-P: Lower Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
You can directly extract target words from either of these models by running $ ./extract.sh
. You will be prompted to enter the name of the model you want to use.
OT = Old Testament
NT = New Testament
ISO 639-3 | Language | Language family | OT | NT | Notes |
---|---|---|---|---|---|
acu | Achuar-Shiwiar | Jivaroan | N | Y | |
afr | Afrikaans | Indo-European > Germanic | Y | Y | |
agr | Awajún | Jivaroan | N | Y | |
ake | Akawaio | Cariban | N | Y | |
sqi/alb | Albanian | Indo-European | Y | Y | |
amh | Amharic | Afro-Asiatic > Semitic | Y | N | |
amu | Guerrero Amuzgo | Otomanguean | N | Y | |
ara | Arabic | Afro-Asiatic > Semitic | Y | Y | |
hye/arm | Armenian | Indo-European | Y | Y | |
baq | Basque | Isolate | N | Y | |
bsn | Barasana-Eduria | Tucanoan | N | Y | |
bul | Bulgarian | Indo-European > Balto-Slavic | Y | Y | |
cak | Kaqchikel | Mayan | N | Y | |
ceb | Cebuano | Austronesian > Malayo-Polynesian | Y | Y | |
cha | Chamorro | Austronesian > Malayo-Polynesian | Y | Y | OT only consists of the Psalms |
zho/chi | Chinese | Sino-Tibetan > Sinitic | Y | Y | |
chq | Quiotepec Chinantec | Otomanguean | N | Y | |
chr | Cherokee | Iroquoian | N | Y | |
chu | Church Slavonic | Indo-European > Balto-Slavic | N | Y | |
cjp | Cabécar | Chibchan | N | Y | |
cni | Asháninka | Maipurean | N | Y | |
cop | Coptic | Afro-Asiatic > Egyptian | N | Y | |
crp | Creoles and pidgins | Creole > French-based | Y | Y | The original XML files have the generic 'crp' code. This is however Haitian Creole (code hat) |
cze | Czech | Indo-European > Balto-Slavic | Y | Y | |
dan | Danish | Indo-European > Germanic | Y | Y | |
deu | German | Indo-European > Germanic | Y | Y | |
dik | Southwestern Dinka | Nilo-Saharan > Nilotic | N | Y | |
dje | Zarma | Nilo-Saharan > Songhai | Y | Y | |
dop | Lukpa | Niger-Congo > Atlantic-Congo | N | Y | |
epo | Esperanto | Constructed | Y | Y | |
est | Estonian | Uralic | Y | Y | |
ewe | Ewe | Niger-Congo > Atlantic-Congo | N | Y | |
fin | Finnish | Uralic | Y | Y | |
fra | French | Indo-European > Italic | Y | Y | |
gbi | Galela | West Papuan | N | Y | |
gla | Scottish Gaelic | Indo-European > Celtic | N | Y | The only text included is the Gospel of Mark |
glv | Manx | Indo-European > Celtic | Y | Y | The only text from the OT is the Book of Esther |
got | Gothic | Indo-European > Germanic | N | Y | |
grc | Ancient Greek (to 1453) | Indo-European | N | Y | |
ell/gre | Modern Greek (1453-) | Indo-European | Y | Y | |
guj | Gujarati | Indo-European > Indo-Iranian | N | Y | |
heb | Hebrew | Afro-Asiatic > Semitic | Y | N | |
hin | Hindi | Indo-European > Indo-Iranian | Y | Y | |
hrv | Croatian | Indo-European > Balto-Slavic | Y | Y | |
hun | Hungarian | Uralic | Y | Y | |
ind | Indonesian | Austronesian > Malayo-Polynesian | Y | Y | |
isl | Icelandic | Indo-European > Germanic | Y | Y | |
ita | Italian | Indo-European > Italic | Y | Y | |
jak | Jakun | Austronesian > Malayo-Polynesian | N | Y | |
jap | Japanese | Japonic | Y | Y | |
jiv | Shuar | Jivaroan | N | Y | |
kab | Kabyle-Amazigh | Afro-Asiatic > Berber | N | Y | |
kbh | Camsá | Isolate | N | Y | |
kor | Korean | Koreanic | Y | Y | |
lat | Latin | Indo-European > Italic | Y | Y | |
lav | Latvian | Indo-European > Balto-Slavic | N | Y | |
lit | Lithuanian | Indo-European > Balto-Slavic | Y | Y | |
mal | Malayalam | Dravidian | Y | Y | |
mam | Mam | Mayan | N | Y | |
mao | Maori | Austronesian > Malayo-Polynesian | Y | Y | |
mar | Marathi | Indo-European > Indo-Iranian | Y | Y | |
mya | Burmese | Sino-Tibetan > Tibeto-Burman | Y | Y | |
nep | Nepali | Indo-European > Indo-Iranian | Y | Y | |
nhg | Tetelcingo Nahuatl | Uto-Aztecan | N | Y | |
nld | Dutch | Indo-European > Germanic | Y | Y | |
nor | Norwegian | Indo-European > Germanic | Y | Y | |
ojb | Northwestern Ojibwa | Algic > Algonquian | N | Y | |
pck | Paite Chin | Sino-Tibetan > Tibeto-Burman | Y | Y | |
pes | Iranian Persian | Indo-European > Indo-Iranian | Y | Y | |
plt | Plateau Malagasy | Austronesian > Malayo-Polynesian | Y | Y | |
pol | Polish | Indo-European > Balto-Slavic | Y | Y | |
por | Portuguese | Indo-European > Italic | Y | Y | |
pot | Potawatomi | Algic > Algonquian | N | Y | |
ppk | Uma | Austronesian > Malayo-Polynesian | N | Y | |
quc | K'iche' | Mayan | N | Y | |
quw | Tena Lowland Quichua | Quechuan | N | Y | |
rom | Romany | Indo-European > Indo-Iranian | N | Y | |
ron/rum | Romanian | Indo-European > Italic | Y | Y | |
rus | Russian | Indo-European > Balto-Slavic | Y | Y | |
shi | Tachelhit | Afro-Asiatic > Berber | N | Y | |
slk | Slovak | Indo-European > Balto-Slavic | Y | Y | |
slv | Slovenian | Indo-European > Balto-Slavic | Y | Y | |
sna | Shona | Niger-Congo > Atlantic-Congo | Y | Y | |
som | Somali | Afro-Asiatic > Cushitic | Y | Y | |
spa | Spanish | Indo-European > Italic | Y | Y | |
srp | Serbian | Indo-European > Balto-Slavic | Y | Y | |
ssw | Swati | Niger-Congo > Atlantic-Congo | N | Y | |
swe | Swedish | Indo-European > Germanic | Y | Y | |
syr | Syriac | Afro-Asiatic > Semitic | N | Y | |
tel | Telugu | Dravidian | Y | Y | |
tgl | Tagalog | Austronesian > Malayo-Polynesian | Y | Y | |
tha | Thai | Kra-Dai > Tai | Y | Y | |
tmh | Tamashek | Afro-Asiatic > Berber | Y | Y | |
tur | Turkish | Turkic | Y | Y | |
ukr | Ukrainian | Indo-European > Balto-Slavic | N | Y | |
usp | Uspanteco | Mayan | N | Y | |
wal | Wolaytta | Afro-Asiatic > Omotic | N | Y | |
wol | Wolof | Niger-Congo > Atlantic-Congo | N | Y | |
xho | Xhosa | Niger-Congo > Atlantic-Congo | Y | Y | |
zul | Zulu | Niger-Congo > Atlantic-Congo | N | Y |
- Include the following languages: a. In all models: vie, kan, djk, kek, agr, mal b. In model4-LC-P only: mar, mya, nep, tel
- Fix issue with display of some non-Latin characters in PDF output (notably all Arabic!). Note that the characters display normally in R studio (i.e. it must be an issue with both base R pdf and CairoPDF).
- Add info on how NULLs are treated in the models.
- Add on how many NAs we have per language based on best model.
[1] Junczys-Dowmunt, Marcin & Arkadiusz Szał. 2012. SyMGiza++: Symmetrized Word Alignment Models for Machine Translation. In Pascal Bouvry, Mieczyslaw A. Klopotek, Franck Leprévost, Malgorzata Marciniak, Agnieszka Mykowiecka & Henryk Rybinski (eds.), Security and Intelligent Information Systems (SIIS) (Lecture Notes in Computer Science 7053), 379-390. Heidelberg-Berlin: Springer.
[2] Wälchli, Bernhard. 2010. Similarity Semantics and Building Probabilistic Semantic Maps from Parallel Texts. Linguistic Discovery 8(1). 331-371. DOI:10.1349/PS1.1537-0852.A.356