This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #263 from rsepassi/push
v1.2.1
- Loading branch information
Showing
41 changed files
with
2,161 additions
and
547 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# T2T: Life of an Example | ||
|
||
[![PyPI | ||
version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor) | ||
[![GitHub | ||
Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues) | ||
[![Contributions | ||
welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md) | ||
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby) | ||
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0) | ||
|
||
This document show how a training example passes through the T2T pipeline, | ||
and how all its parts are connected to work together. | ||
|
||
## The Life of an Example | ||
|
||
A training example passes the following stages in T2T: | ||
* raw input (text from command line or file) | ||
* encoded input after [Problem.feature_encoder](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L173) function `encode` is usually a sparse tensor, e.g., a vector of `tf.int32`s | ||
* batched input after [data input pipeline](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py#L242) where the inputs, after [Problem.preprocess_examples](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L188) are grouped by their length and made into batches. | ||
* dense input after being processed by a [Modality](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/modality.py#L30) function `bottom`. | ||
* dense output after [T2T.model_fn_body](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/t2t_model.py#L542) | ||
* back to sparse output through [Modality](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/modality.py#L30) function `top`. | ||
* if decoding, back through [Problem.feature_encoder](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L173) function `decode` to display on the screen. | ||
|
||
We go into these phases step by step below. | ||
|
||
## Feature Encoders | ||
|
||
TODO: describe [Problem.feature_encoder](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L173) which is a dict of encoders that have `encode` and `decode` functions. | ||
|
||
## Modalities | ||
|
||
TODO: describe [Modality](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/modality.py#L30) which has `bottom` and `top` but also sharded versions and one for targets. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# T2T: Create Your Own Model | ||
|
||
[![PyPI | ||
version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor) | ||
[![GitHub | ||
Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues) | ||
[![Contributions | ||
welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md) | ||
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby) | ||
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0) | ||
|
||
Here we show how to create your own model in T2T. | ||
|
||
## The T2TModel class | ||
|
||
TODO: complete. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
# T2T: Train on Your Own Data | ||
|
||
[![PyPI | ||
version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor) | ||
[![GitHub | ||
Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues) | ||
[![Contributions | ||
welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md) | ||
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby) | ||
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0) | ||
|
||
Let's add a new dataset together and train the transformer model. We'll be learning to define English words by training the transformer to "translate" between English words and their definitions on a character level. | ||
|
||
# About the Problem | ||
|
||
For each problem we want to tackle we create a new problem class and register it. Let's call our problem `Word2def`. | ||
|
||
Since many text2text problems share similar methods, there's already a class called `Text2TextProblem` that extends the base problem class, `Problem` (both found in `problem.py`). | ||
|
||
For our problem, we can go ahead and create the file `word2def.py` in the `data_generators` folder and add our new problem, `Word2def`, which extends `TranslateProblem`. Let's also register it while we're at it so we can specify the problem through flags. | ||
|
||
```python | ||
@registry.register_problem() | ||
class Word2def(problem.Text2TextProblem): | ||
"""Problem spec for English word to dictionary definition.""" | ||
return NotImplementedError() | ||
``` | ||
|
||
We need to implement the following methods from `Text2TextProblem` in our new class: | ||
* is_character_level | ||
* targeted_vocab_size | ||
* generator | ||
* input_space_id | ||
* target_space_id | ||
* num_shards | ||
* vocab_name | ||
* use_subword_tokenizer | ||
|
||
Let's tackle them one by one: | ||
|
||
**input_space_id, target_space_id, is_character_level, targeted_vocab_size, use_subword_tokenizer**: | ||
|
||
SpaceIDs tell Tensor2Tensor what sort of space the input and target tensors are in. These are things like, EN_CHR (English character), EN_TOK (English token), AUDIO_WAV (audio waveform), IMAGE, DNA (genetic bases). The complete list can be found at `data_generators/problem.py` in the class `SpaceID`. | ||
|
||
Since we're generating definitions and feeding in words at the character level, we set `is_character_level` to true, and use the same SpaceID, EN_CHR, for both input and target. Additionally, since we aren't using tokens, we don't need to give a `targeted_vocab_size` or define `use_subword_tokenizer`. | ||
|
||
**vocab_name**: | ||
|
||
`vocab_name` will be used to name your vocabulary files. We can call ours `'vocab.word2def.en'` | ||
|
||
**num_shards**: | ||
|
||
The number of shards to break data files into. | ||
|
||
```python | ||
@registry.register_problem() | ||
class Word2def(problem.Text2TextProblem): | ||
"""Problem spec for English word to dictionary definition.""" | ||
def is_character_level(self): | ||
return True | ||
|
||
@property | ||
def vocab_name(self): | ||
return "vocab.word2def.en" | ||
|
||
@property | ||
def input_space_id(self): | ||
return problem.SpaceID.EN_CHR | ||
|
||
@property | ||
def target_space_id(self): | ||
return problem.SpaceID.EN_CHR | ||
|
||
@property | ||
def num_shards(self): | ||
return 100 | ||
|
||
@property | ||
def use_subword_tokenizer(self): | ||
return False | ||
``` | ||
|
||
**generator**: | ||
|
||
We're almost done. `generator` generates the training and evaluation data and stores them in files like "word2def_train.lang1" in your DATA_DIR. Thankfully several commonly used methods like `character_generator`, and `token_generator` are already written in the file `wmt.py`. We will import `character_generator` and write: | ||
```python | ||
def generator(self, data_dir, tmp_dir, train): | ||
character_vocab = text_encoder.ByteTextEncoder() | ||
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS | ||
tag = "train" if train else "dev" | ||
return character_generator(datasets[0], datasets[1], character_vocab, EOS) | ||
``` | ||
|
||
Now our `word2def.py` file looks like the below: | ||
|
||
```python | ||
@registry.register_problem() | ||
class Word2def(problem.Text2TextProblem): | ||
"""Problem spec for English word to dictionary definition.""" | ||
@property | ||
def is_character_level(self): | ||
return True | ||
|
||
@property | ||
def vocab_name(self): | ||
return "vocab.word2def.en" | ||
|
||
def generator(self, data_dir, tmp_dir, train): | ||
character_vocab = text_encoder.ByteTextEncoder() | ||
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS | ||
tag = "train" if train else "dev" | ||
return character_generator(datasets[0], datasets[1], character_vocab, EOS) | ||
|
||
@property | ||
def input_space_id(self): | ||
return problem.SpaceID.EN_CHR | ||
|
||
@property | ||
def target_space_id(self): | ||
return problem.SpaceID.EN_CHR | ||
|
||
@property | ||
def num_shards(self): | ||
return 100 | ||
|
||
@property | ||
def use_subword_tokenizer(self): | ||
return False | ||
``` | ||
|
||
## Data: | ||
Now we need to tell Tensor2Tensor where our data is located. | ||
|
||
I've gone ahead and split all words into a train and test set and saved them in files called `words.train.txt`, `words.test.txt`, | ||
`definitions.train.txt`, and `definitions.test.txt` in a directory called `LOCATION_OF_DATA/`. Let's tell T2T where these files are: | ||
|
||
```python | ||
# English Word2def datasets | ||
_WORD2DEF_TRAIN_DATASETS = [ | ||
[ | ||
"LOCATION_OF_DATA/", ("words_train.txt", "definitions_train.txt") | ||
] | ||
] | ||
_WORD2DEF_TEST_DATASETS = [ | ||
[ | ||
"LOCATION_OF_DATA", ("words_test.txt", "definitions_test.txt") | ||
] | ||
] | ||
``` | ||
|
||
## Putting it all together | ||
|
||
Now our `word2def.py` file looks like: (with the correct imports) | ||
```python | ||
""" Problem definition for word to dictionary definition. | ||
""" | ||
|
||
from __future__ import absolute_import | ||
from __future__ import division | ||
from __future__ import print_function | ||
|
||
import os | ||
import tarfile # do we need this import | ||
|
||
from tensor2tensor.data_generators import generator_utils | ||
from tensor2tensor.data_generators import problem | ||
from tensor2tensor.data_generators import text_encoder | ||
from tensor2tensor.data_generators.wmt import character_generator | ||
|
||
from tensor2tensor.utils import registry | ||
|
||
import tensorflow as tf | ||
|
||
FLAGS = tf.flags.FLAGS | ||
|
||
# English Word2def datasets | ||
_WORD2DEF_TRAIN_DATASETS = [ | ||
LOCATION_OF_DATA+'words_train.txt', | ||
LOCATION_OF_DATA+'definitions_train.txt' | ||
] | ||
|
||
_WORD2DEF_TEST_DATASETS = [ | ||
LOCATION_OF_DATA+'words_test.txt', | ||
LOCATION_OF_DATA+'definitions_test.txt' | ||
] | ||
|
||
@registry.register_problem() | ||
class Word2def(problem.Text2TextProblem): | ||
"""Problem spec for English word to dictionary definition.""" | ||
@property | ||
def is_character_level(self): | ||
return True | ||
|
||
@property | ||
def vocab_name(self): | ||
return "vocab.word2def.en" | ||
|
||
def generator(self, data_dir, tmp_dir, train): | ||
character_vocab = text_encoder.ByteTextEncoder() | ||
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS | ||
tag = "train" if train else "dev" | ||
return character_generator(datasets[0], datasets[1], character_vocab, EOS) | ||
|
||
@property | ||
def input_space_id(self): | ||
return problem.SpaceID.EN_CHR | ||
|
||
@property | ||
def target_space_id(self): | ||
return problem.SpaceID.EN_CHR | ||
|
||
@property | ||
def num_shards(self): | ||
return 100 | ||
|
||
@property | ||
def use_subword_tokenizer(self): | ||
return False | ||
|
||
``` | ||
|
||
# Hyperparameters | ||
All hyperparamters inherit from `_default_hparams()` in `problem.py.` If you would like to customize your hyperparameters, add another method to the file `problem_hparams.py`. | ||
|
||
# Run the problem | ||
Now that we've gotten our problem set up, let's train a model and generate definitions. | ||
|
||
We specify our problem name, the model, and hparams. | ||
```bash | ||
PROBLEM=word2def | ||
MODEL=transformer | ||
HPARAMS=transofmer_base_single_gpu | ||
``` | ||
|
||
The rest of the steps are as given in the [walkthrough](walkthrough.md). | ||
|
||
|
||
What if we wanted to train a model to generate words given definitions? In T2T, we can change the problem name to be `PROBLEM=word2def_rev`. | ||
|
||
All done. Let us know what definitions your model generated. |
Oops, something went wrong.