Merge pull request #167 from rsepassi/push

v1.1.0
tensorflow · Jul 19, 2017 · 47d556a · 47d556a
2 parents 963730e + f703629
commit 47d556a
Show file tree

Hide file tree

Showing 30 changed files with 1,020 additions and 423 deletions.
diff --git a/README.md b/README.md
@@ -153,7 +153,7 @@ python -c "from tensor2tensor.models.transformer import Transformer"
   specification.
 * Support for multi-GPU machines and synchronous (1 master, many workers) and
   asynchrounous (independent workers synchronizing through a parameter server)
-  distributed training.
+  [distributed training](https://github.com/tensorflow/tensor2tensor/tree/master/docs/distributed_training.md).
 * Easily swap amongst datasets and models by command-line flag with the data
   generation script `t2t-datagen` and the training script `t2t-trainer`.
 
@@ -173,8 +173,10 @@ and many common sequence datasets are already available for generation and use.
 
 **Problems** define training-time hyperparameters for the dataset and task,
 mainly by setting input and output **modalities** (e.g. symbol, image, audio,
-label) and vocabularies, if applicable. All problems are defined in
-[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py).
+label) and vocabularies, if applicable. All problems are defined either in
+[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py)
+or are registered with `@registry.register_problem` (run `t2t-datagen` to see
+the list of all available problems).
 **Modalities**, defined in
 [`modality.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/modality.py),
 abstract away the input and output data types so that **models** may deal with
@@ -211,7 +213,7 @@ inference. Users can easily switch between problems, models, and hyperparameter
 sets by using the `--model`, `--problems`, and `--hparams_set` flags. Specific
 hyperparameters can be overridden with the `--hparams` flag. `--schedule` and
 related flags control local and distributed training/evaluation
-([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/docs/distributed_training.md)).
+([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/docs/distributed_training.md)).
 
 ---
 
@@ -222,7 +224,7 @@ enables easily adding new ones and easily swapping amongst them by command-line
 flag. You can add your own components without editing the T2T codebase by
 specifying the `--t2t_usr_dir` flag in `t2t-trainer`.
 
-You can currently do so for models, hyperparameter sets, and modalities. Please
+You can do so for models, hyperparameter sets, modalities, and problems. Please
 do submit a pull request if your component might be useful to others.
 
 Here's an example with a new hyperparameter set:
@@ -253,9 +255,18 @@ You'll see under the registered HParams your
 `transformer_my_very_own_hparams_set`, which you can directly use on the command
 line with the `--hparams_set` flag.
 
+`t2t-datagen` also supports the `--t2t_usr_dir` flag for `Problem`
+registrations.
+
 ## Adding a dataset
 
-See the [data generators
+To add a new dataset, subclass
+[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
+and register it with `@registry.register_problem`. See
+[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+for an example.
+
+Also see the [data generators
 README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/README.md).
 
 ---

diff --git a/tensor2tensor/docs/distributed_training.md → docs/distributed_training.md b/tensor2tensor/docs/distributed_training.md → docs/distributed_training.md
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,23 @@
+# T2T: Tensor2Tensor Transformers
+
+Check us out on
+<a href=https://github.com/tensorflow/tensor2tensor>
+GitHub
+<img src="https://github.com/favicon.ico" width="16">
+</a>
+.
+
+[![PyPI
+version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor)
+[![GitHub
+Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues)
+[![Contributions
+welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
+[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
+[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
+
+See our
+[README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/README.md)
+for documentation.
+
+More documentation and tutorials coming soon...
diff --git a/setup.py b/setup.py
@@ -5,7 +5,7 @@
 
 setup(
     name='tensor2tensor',
-    version='1.0.14',
+    version='1.1.0',
     description='Tensor2Tensor',
     author='Google Inc.',
     author_email='[email protected]',

diff --git a/tensor2tensor/bin/t2t-datagen b/tensor2tensor/bin/t2t-datagen
@@ -35,7 +35,6 @@ import tempfile
 
 import numpy as np
 
-from tensor2tensor.data_generators import algorithmic
 from tensor2tensor.data_generators import algorithmic_math
 from tensor2tensor.data_generators import all_problems  # pylint: disable=unused-import
 from tensor2tensor.data_generators import audio
@@ -60,52 +59,22 @@ flags.DEFINE_string("tmp_dir", "/tmp/t2t_datagen",
                     "Temporary storage directory.")
 flags.DEFINE_string("problem", "",
                     "The name of the problem to generate data for.")
+flags.DEFINE_string("exclude_problems", "",
+                    "Comma-separates list of problems to exclude.")
 flags.DEFINE_integer("num_shards", 10, "How many shards to use.")
 flags.DEFINE_integer("max_cases", 0,
                      "Maximum number of cases to generate (unbounded if 0).")
 flags.DEFINE_integer("random_seed", 429459, "Random seed to use.")
-
 flags.DEFINE_string("t2t_usr_dir", "",
                     "Path to a Python module that will be imported. The "
                     "__init__.py file should include the necessary imports. "
                     "The imported files should contain registrations, "
-                    "e.g. @registry.register_model calls, that will then be "
-                    "available to the t2t-datagen.")
+                    "e.g. @registry.register_problem calls, that will then be "
+                    "available to t2t-datagen.")
 
 # Mapping from problems that we can generate data for to their generators.
 # pylint: disable=g-long-lambda
 _SUPPORTED_PROBLEM_GENERATORS = {
-    "algorithmic_shift_decimal40": (
-        lambda: algorithmic.shift_generator(20, 10, 40, 100000),
-        lambda: algorithmic.shift_generator(20, 10, 80, 10000)),
-    "algorithmic_reverse_binary40": (
-        lambda: algorithmic.reverse_generator(2, 40, 100000),
-        lambda: algorithmic.reverse_generator(2, 400, 10000)),
-    "algorithmic_reverse_decimal40": (
-        lambda: algorithmic.reverse_generator(10, 40, 100000),
-        lambda: algorithmic.reverse_generator(10, 400, 10000)),
-    "algorithmic_addition_binary40": (
-        lambda: algorithmic.addition_generator(2, 40, 100000),
-        lambda: algorithmic.addition_generator(2, 400, 10000)),
-    "algorithmic_addition_decimal40": (
-        lambda: algorithmic.addition_generator(10, 40, 100000),
-        lambda: algorithmic.addition_generator(10, 400, 10000)),
-    "algorithmic_multiplication_binary40": (
-        lambda: algorithmic.multiplication_generator(2, 40, 100000),
-        lambda: algorithmic.multiplication_generator(2, 400, 10000)),
-    "algorithmic_multiplication_decimal40": (
-        lambda: algorithmic.multiplication_generator(10, 40, 100000),
-        lambda: algorithmic.multiplication_generator(10, 400, 10000)),
-    "algorithmic_reverse_nlplike_decimal8K": (
-        lambda: algorithmic.reverse_generator_nlplike(8000, 70, 100000,
-                                                      10, 1.300),
-        lambda: algorithmic.reverse_generator_nlplike(8000, 70, 10000,
-                                                      10, 1.300)),
-    "algorithmic_reverse_nlplike_decimal32K": (
-        lambda: algorithmic.reverse_generator_nlplike(32000, 70, 100000,
-                                                      10, 1.050),
-        lambda: algorithmic.reverse_generator_nlplike(32000, 70, 10000,
-                                                      10, 1.050)),
     "algorithmic_algebra_inverse": (
         lambda: algorithmic_math.algebra_inverse(26, 0, 2, 100000),
         lambda: algorithmic_math.algebra_inverse(26, 3, 3, 10000)),
@@ -125,29 +94,9 @@ _SUPPORTED_PROBLEM_GENERATORS = {
                                                     2**14, 2**9),
         lambda: wsj_parsing.parsing_token_generator(FLAGS.tmp_dir, False,
                                                     2**14, 2**9)),
-    "wmt_enfr_characters": (
-        lambda: wmt.enfr_character_generator(FLAGS.tmp_dir, True),
-        lambda: wmt.enfr_character_generator(FLAGS.tmp_dir, False)),
-    "wmt_enfr_tokens_8k": (
-        lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**13),
-        lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, False, 2**13)
-    ),
-    "wmt_enfr_tokens_32k": (
-        lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
-        lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, False, 2**15)
-    ),
-    "wmt_ende_characters": (
-        lambda: wmt.ende_character_generator(FLAGS.tmp_dir, True),
-        lambda: wmt.ende_character_generator(FLAGS.tmp_dir, False)),
     "wmt_ende_bpe32k": (
         lambda: wmt.ende_bpe_token_generator(FLAGS.tmp_dir, True),
         lambda: wmt.ende_bpe_token_generator(FLAGS.tmp_dir, False)),
-    "wmt_zhen_tokens_32k": (
-        lambda: wmt.zhen_wordpiece_token_generator(FLAGS.tmp_dir, True,
-                                                   2**15, 2**15),
-        lambda: wmt.zhen_wordpiece_token_generator(FLAGS.tmp_dir, False,
-                                                   2**15, 2**15)
-    ),
     "lm1b_32k": (
         lambda: lm1b.generator(FLAGS.tmp_dir, True),
         lambda: lm1b.generator(FLAGS.tmp_dir, False)
@@ -286,6 +235,9 @@ def main(_):
   # Calculate the list of problems to generate.
   problems = sorted(
       list(_SUPPORTED_PROBLEM_GENERATORS) + registry.list_problems())
+  for exclude in FLAGS.exclude_problems.split(","):
+    if exclude:
+      problems = [p for p in problems if exclude not in p]
   if FLAGS.problem and FLAGS.problem[-1] == "*":
     problems = [p for p in problems if p.startswith(FLAGS.problem[:-1])]
   elif FLAGS.problem:

diff --git a/tensor2tensor/bin/t2t-trainer b/tensor2tensor/bin/t2t-trainer
@@ -29,14 +29,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import importlib
-import os
-import sys
-
 # Dependency imports
 
 from tensor2tensor.utils import trainer_utils as utils
 from tensor2tensor.utils import usr_dir
+
 import tensorflow as tf
 
 flags = tf.flags
@@ -49,6 +46,7 @@ flags.DEFINE_string("t2t_usr_dir", "",
                     "e.g. @registry.register_model calls, that will then be "
                     "available to the t2t-trainer.")
 
+
 def main(_):
   tf.logging.set_verbosity(tf.logging.INFO)
   usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)

diff --git a/tensor2tensor/data_generators/README.md b/tensor2tensor/data_generators/README.md
@@ -1,7 +1,7 @@
-# Data generators for T2T models.
+# T2T Problems.
 
-This directory contains data generators for a number of problems. We use a
-naming scheme for the problems, they have names of the form
+This directory contains `Problem` specifications for a number of problems. We
+use a naming scheme for the problems, they have names of the form
 `[task-family]_[task]_[specifics]`.  Data for all currently supported problems
 can be generated by calling the main generator binary (`t2t-datagen`). For
 example:
@@ -20,53 +20,51 @@ All tasks produce TFRecord files of `tensorflow.Example` protocol buffers.
 
 ## Adding a new problem
 
-1. Implement and register a Python generator for the dataset
-1. Add a problem specification to `problem_hparams.py` specifying input and
-   output modalities
+To add a new problem, subclass
+[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
+and register it with `@registry.register_problem`. See
+[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+for an example.
 
-To add a new problem, you first need to create python generators for training
-and development data for the problem. The python generators should yield
-dictionaries with string keys and values being lists of {int, float, str}.
-Here is a very simple generator for a data-set where inputs are lists of 1s with
-length upto 100 and targets are lists of length 1 with an integer denoting the
-length of the input list.
+`Problem`s support data generation, training, and decoding.
+
+Data generation is handles by `Problem.generate_data` which should produce 2
+datasets, training and dev, which should be named according to
+`Problem.training_filepaths` and `Problem.dev_filepaths`.
+`Problem.generate_data` should also produce any other files that may be required
+for training/decoding, e.g. a vocabulary file.
+
+A particularly easy way to implement `Problem.generate_data` for your dataset is
+to create 2 Python generators, one for the training data and another for the
+dev data, and pass them to `generator_utils.generate_dataset_and_shuffle`. See
+[`WMTEnDeTokens8k.generate_data`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+for an example of usage.
+
+The generators should yield dictionaries with string keys and values being lists
+of {int, float, str}.  Here is a very simple generator for a data-set where
+inputs are lists of 2s with length upto 100 and targets are lists of length 1
+with an integer denoting the length of the input list.
 
 ```
 def length_generator(nbr_cases):
   for _ in xrange(nbr_cases):
     length = np.random.randint(100) + 1
-    yield {"inputs": [1] * length, "targets": [length]}
+    yield {"inputs": [2] * length, "targets": [length]}
 ```
 
-Note that our data reader uses 0 for padding, so it is a good idea to never
-generate 0s, except if all your examples have the same size (in which case
-they'll never be padded anyway) or if you're doing padding on your own (in which
-case please use 0s for padding). When adding the python generator function,
-please also add unit tests to check if the code runs.
+Note that our data reader uses 0 for padding and other parts of the code assume
+end-of-string (EOS) is 1, so it is a good idea to never generate 0s or 1s,
+except if all your examples have the same size (in which case they'll never be
+padded anyway) or if you're doing padding on your own (in which case please use
+0s for padding). When adding the python generator function, please also add unit
+tests to check if the code runs.
 
 The generator can do arbitrary setup before beginning to yield examples - for
 example, downloading data, generating vocabulary files, etc.
 
 Some examples:
 
-*   [Algorithmic generators](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py)
+*   [Algorithmic problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py)
     and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic_test.py)
-*   [WMT generators](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+*   [WMT problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
     and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt_test.py)
-
-When your python generator is ready and tested, add it to the
-`_SUPPORTED_PROBLEM_GENERATORS` dictionary in the
-[data
-generator](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t-datagen).
-The keys are problem names, and the values are pairs of (training-set-generator
-function, dev-set-generator function). For the generator above, one could add
-the following lines:
-
-```
-  "algorithmic_length_upto100":
-  (lambda: algorithmic.length_generator(10000),
-   lambda: algorithmic.length_generator(1000)),
-```
-
-Note the lambdas above: we don't want to call the generators too early.
-