This provides a small amount of textual data, for use in testing code and sanity checks. You can embed this as a subdirectory in your repo, by typing the following in the main directory of your repo:
git submodule add https://github.com/jonsafari/toy-data
git add .gitmodules
git commit .gitmodules toy-data/ -m 'add .gitmodules'
Depending on your needs, you may need to tokenize and lowercase the data. You can do this using a tokenizer, like Tok-tok, which does multilingual tokenization, lowercasing, digit conflation, and can accomodate empty lines and comments.
The data comes from the WMT News Commentary dataset, and is cleaned up. Download the full data there for large-scale experiments. It is triple-aligned in English, Spanish, and German. Below are the number of sentences in each set:
Set | Sentences | Tokens |
---|---|---|
Training | 1000 | ~23K |
Dev | 200 | ~4.5K |
Test | 200 | ~4.8K |
nc.train.en: But what is the right level?
nc.train.es: Pero ¿cuál es el nivel correcto?
nc.train.de: Aber welche Ebene ist die richtige?
nc.train.en: A Big Chance for Small Farmers
nc.train.es: Una gran oportunidad para los pequeños agricultores
nc.train.de: Große Chance für Kleinbauern