Skip to content

Embeddable submodule of parallel/monolingual text data, for use in testing code and sanity checks

Notifications You must be signed in to change notification settings

jonsafari/toy-data

Repository files navigation

Toy Data

This provides a small amount of textual data, for use in testing code and sanity checks. You can embed this as a subdirectory in your repo, by typing the following in the main directory of your repo:

   git submodule add https://github.com/jonsafari/toy-data
   git add .gitmodules
   git commit .gitmodules toy-data/ -m 'add .gitmodules'

Depending on your needs, you may need to tokenize and lowercase the data. You can do this using a tokenizer, like Tok-tok, which does multilingual tokenization, lowercasing, digit conflation, and can accomodate empty lines and comments.

The data comes from the WMT News Commentary dataset, and is cleaned up. Download the full data there for large-scale experiments. It is triple-aligned in English, Spanish, and German. Below are the number of sentences in each set:

Set Sentences Tokens
Training 1000 ~23K
Dev 200 ~4.5K
Test 200 ~4.8K

Examples

nc.train.en:  But what is the right level?
nc.train.es:  Pero ¿cuál es el nivel correcto?
nc.train.de:  Aber welche Ebene ist die richtige?

nc.train.en:  A Big Chance for Small Farmers
nc.train.es:  Una gran oportunidad para los pequeños agricultores
nc.train.de:  Große Chance für Kleinbauern

About

Embeddable submodule of parallel/monolingual text data, for use in testing code and sanity checks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published