Toy Data

This provides a small amount of textual data, for use in testing code and sanity checks. You can embed this as a subdirectory in your repo, by typing the following in the main directory of your repo:

   git submodule add https://github.com/jonsafari/toy-data
   git add .gitmodules
   git commit .gitmodules toy-data/ -m 'add .gitmodules'

Depending on your needs, you may need to tokenize and lowercase the data. You can do this using a tokenizer, like Tok-tok, which does multilingual tokenization, lowercasing, digit conflation, and can accomodate empty lines and comments.

The data comes from the WMT News Commentary dataset, and is cleaned up. Download the full data there for large-scale experiments. It is triple-aligned in English, Spanish, and German. Below are the number of sentences in each set:

Set	Sentences	Tokens
Training	1000	~23K
Dev	200	~4.5K
Test	200	~4.8K

Examples

nc.train.en:  But what is the right level?
nc.train.es:  Pero ¿cuál es el nivel correcto?
nc.train.de:  Aber welche Ebene ist die richtige?

nc.train.en:  A Big Chance for Small Farmers
nc.train.es:  Una gran oportunidad para los pequeños agricultores
nc.train.de:  Große Chance für Kleinbauern

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
nc.dev.de		nc.dev.de
nc.dev.en		nc.dev.en
nc.dev.es		nc.dev.es
nc.test.de		nc.test.de
nc.test.en		nc.test.en
nc.test.es		nc.test.es
nc.train.de		nc.train.de
nc.train.en		nc.train.en
nc.train.es		nc.train.es

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toy Data

Examples

About

Releases

Packages

Languages

jonsafari/toy-data

Folders and files

Latest commit

History

Repository files navigation

Toy Data

Examples

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages