Releases: huggingface/transformers
v2.0.0 - TF 2.0/PyTorch interoperability, improved tokenizers, improved torchscript support
Name change: welcome 🤗 Transformers
Following the extension to TensorFlow 2.0, pytorch-transformers
=> transformers
Install with pip install transformers
Also, note that PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.
TensorFlow 2.0 - PyTorch
All the PyTorch nn.Module
classes now have their counterpart in TensorFlow 2.0 as tf.keras.Model
classes. TensorFlow 2.0 classes have the same name as their PyTorch counterparts prefixed with TF
.
The interoperability between TensorFlow and PyTorch is actually a lot deeper than what is usually meant when talking about libraries with multiple backends:
- each model (not just the static computation graph) can be seamlessly moved from one framework to the other during the lifetime of the model for training/evaluation/usage (
from_pretrained
can load weights saved from models saved in one or the other framework), - an example is given in the quick-tour on TF 2.0 and PyTorch in the readme in which a model is trained using keras.fit before being opened in PyTorch for quick debugging/inspection.
Remaining unsupported operations in TF 2.0 (to be added later):
- resizing input embeddings to add new tokens
- pruning model heads
TPU support
Training on TPU using free TPUs provided in the TensorFlow Research Cloud (TFRC) program is possible but requires to implement a custom training loop (not possible with keras.fit at the moment).
We will add an example of such a custom training loop soon.
Improved tokenizers
Tokenizers have been improved to provide extended encoding methods encoding_plus
and additional arguments to encoding
. Please refer to the doc for detailed usage of the new options.
Breaking changes
Positional order of some model keywords inputs changed (better TorchScript support)
To be able to better use Torchscript both on CPU and GPUs (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask
, token_type_ids
...) has been changed.
If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
, this should not cause any breaking change.
If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids)
, you should double-check the exact order of input arguments.
Dependency requirements have changed
PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.
Renamed method
The method add_special_tokens_sentence_pair
has been renamed to the more appropriate name add_special_tokens_sequence_pair
.
The same holds true for the method add_special_tokens_single_sentence
which has been changed to add_special_tokens_single_sequence
.
Community additions/bug-fixes/improvements
- new German model (@Timoeller)
- new script for MultipleChoice training (SWAG, RocStories...) (@erenup)
- better fp16 support (@ziliwang and @bryant1410)
- fix evaluation in run_lm_finetuning (@SKRohit)
- fiw LM finetuning to prevent crashing on assert len(tokens_b)>=1 (@searchivarius)
- Various doc and docstring fixes (@sshleifer, @Maxpa1n, @mattolson93, @T080)
DistilBERT, GPT-2 Large, XLM multilingual models, torch.hub, bug fixes
New model architecture: DistilBERT
Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.
This new model architecture comes with two pretrained checkpoints:
distilbert-base-uncased
: the base DistilBert modeldistilbert-base-uncased-distilled-squad
: DistilBert model fine-tuned with distillation on SQuAD.
New GPT2 checkpoint: GPT-2 large (774M parameters)
The third OpenAI GPT-2 checkpoint is available in the library: 774M parameters, 36 layers, and 20 heads.
New XLM multilingual checkpoints: 17 & 100 languages
We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.
Back on torch.hub
with all the architectures
Pytorch-Transformers torch.hub
interface is based on Auto-Models which are generic classes designed to be instantiated using from_pretrained()
in a model architecture guessed from the pretrained checkpoint name (ex AutoModel.from_pretrained('bert-base-uncased') will instantiate a
BertModeland load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models:
AutoModel,
AutoModelWithLMHead,
AutoModelForSequenceClassificationand
AutoModelForQuestionAnswering`.
New dependency: sacremoses
Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses
, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.
In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *
* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.
Bug fixes and improvements to the library modules
- Bertology script has seen major improvements (@tuvuumass )
- Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan)
- Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik )
- Added GPT-2 Large 774M model (@thomwolf )
- Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf )
- Multi-GPU training has been patched (@FeiWang96 )
- Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 )
- Updated the in-depth BERT fine-tuning scripts to
pytorch-transformers
(@Morizeyao ) - Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf)
- Add
proxies
andforce_download
options tofrom_pretrained()
method to be able to use proxies and update cached models/tokenizers (@thomwolf) - Add shortcut to each special tokens with
_id
properties (e.g.tokenizer.cls_token_id
for the id in the vocabulary oftokenizer.cls_token
) (@thomwolf) - Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf)
- Fix and clean up byte-level BPE tests (@thomwolf)
- Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik )
- Fix a warning raised when the decode method is called for a model with no
sep_token
like GPT-2 (@LysandreJik ) - Updated the tokenizers saving method (@boy2000-007man)
- SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies )
- Stable
EnvironmentErrors
have been added to utility files (@abhishekraok ) - Fixed distributed barrier hang (@VictorSanh )
- Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik )
- Change layer norm code to PyTorch's native layer norm (@dhpollack)
- Improved tokenization for XLM for multilingual inputs (@shijie-wu)
- Add language input and access to language to id conversion in XLM tokenizer (@thomwolf)
- Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf)
- Added new AutoModels:
AutoModelWithLMHead
,AutoModelForSequenceClassification
,AutoModelForQuestionAnswering
(@LysandreJik) - Torch.hub is now based on AutoModels (@LysandreJik @thomwolf)
- Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan)
- Adding DistilBert model architecture and checkpoints (@VictorSanh @LysandreJik @thomwolf)
- Fixes to DistilBert configuration and training script (@stefan-it)
- Fix XLNet attention mask for fp16 (@ziliwang)
- Documentation auto-deploy (@LysandreJik)
- Fix to add a tuple of tokens (@epwalsh)
- Update fp16 apex implementation in scripts (@anhnt170489)
- Fix XLNet bias resizing when adding/removing tokens (@LysandreJik)
- Fix tokenizer reloading in example scripts (@rabeehk)
- Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik)
- Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)
New model: RoBERTa, tokenizer sequence pair handling for sequence classification models.
New model: RoBERTa
RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
Thanks to Myle Ott from Facebook for his help.
Tokenizer sequence pair handling
Tokenizers get two new methods:
tokenizer.add_special_tokens_single_sentence(token_ids)
and
tokenizer.add_special_tokens_sentences_pair(token_ids_0, token_ids_1)
These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the cls
and sep
tokens according to the way the model was trained.
Sequence pair examples:
For BERT:
[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]
For RoBERTa:
<s> SEQUENCE_0 </s></s> SEQUENCE_1 </s>
Tokenizer encoding function
The tokenizer encode function gets two new arguments:
tokenizer.encode(text, text_pair=None, add_special_tokens=False)
If the text_pair
is specified, encode
will return a tuple of encoded sequences. If the add_special_tokens
is set to True
, the sequences will be built with the models' respective special tokens using the previously described methods.
AutoConfig, AutoModel and AutoTokenizer
There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration: AutoConfig
, AutoModel
, and AutoTokenizer
.
Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert", AutoConfig
instantiates a BertConfig
, AutoModel
instantiates a BertModel
and AutoTokenizer
instantiates a BertTokenizer
.
The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found here.
Examples
Some examples have been refactored to better reflect the current library. Those are: simple_lm_finetuning.py
, finetune_on_pregenerated.py
, as well as run_glue.py
that has been adapted to the RoBERTa model. The examples run_squad
and run_glue.py
have better dataset processing with caching.
Bug fixes and improvements to the library modules
- Fixed multi-gpu training when using FP16 (@zijunsun)
- Re-added the possibility to import BertPretrainedModel (@thomwolf)
- Improvements to tensorflow -> pytorch checkpoints (@dhpollack)
- Fixed save_pretrained to save the correct added tokens (@joelgrus)
- Fixed version issues in run_openai_gpt (@rabeehk)
- Fixed issue with line return with Chinese BERT (@Yiqing-Zhou)
- Added more flexibility regarding the
PretrainedModel.from_pretrained
(@xanlsh) - Fixed issues regarding backward compatibility to Pytorch 1.0.0 (@thomwolf)
- Added the unknown token to GPT-2 (@thomwolf)
v1.0.0 - Name change, new models (XLNet, XLM), unified API for models and tokenizer, access to models internals, torchscript
Name change: welcome PyTorch-Transformers 👾
pytorch-pretrained-bert
=> pytorch-transformers
Install with pip install pytorch-transformers
New models
- XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
- XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
New pretrained weights
We went from ten (in pytorch-pretrained-bert
0.6.2) to twenty-seven (in pytorch-transformers
1.0) pretrained model weights.
The newly added model weights are, in summary:
- Two
Whole-Word-Masking
weights for Bert (cased and uncased) - Three Fine-tuned models for Bert (on SQuAD and MRPC)
- One German model for Bert provided and trained by Deepset.ai (@tholor and @Timoeller) as detailed in their nice blogpost
- One OpenAI GPT-2 model (medium size model)
- Two models (base and large) for the newly added XLNet model
- Eight models for the newly added XLM model
The documentation lists all the models with the shortcut names and we are currently adding full details of the associated pretraining/fine-tuning parameters.
New documentation
New documentation is currently being created at https://huggingface.co/pytorch-transformers/ and should be finalized over the coming days.
Standard API across models
See the readme for a quick tour of the API.
Main points:
- All models now return
tuples
with various elements depending on the model and the configuration. The docstrings and documentation list all the expected outputs in order. - All models can now return the full list of hidden-states (embeddings output + the output hidden-states of each layer)
- All models can now return the full list of attention weights (one tensor of attention weights for each layer)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states=True,
output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]
Standard API to add tokens to the vocabulary and the model
Using tokenizer.add_tokens()
and tokenizer.add_special_tokens()
, one can now easily add tokens to each model vocabulary. The model's input embeddings can be resized accordingly to add associated word embeddings (to be trained) using model.resize_token_embeddings(len(tokenizer))
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))
Serialization
The serialization methods have been standardized and you probably should switch to the new method save_pretrained(save_directory)
if you were using any other serialization method before.
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')
### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
Torchscript
All models are now compatible with Torchscript.
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))
Examples scripts
The examples scripts have been refactored and gathered in three main examples (run_glue.py
, run_squad.py
and run_generation.py
) which are common to several models and are designed to offer SOTA performances on the respective tasks while being clean starting point to design your own scripts.
Other examples scripts (like run_bertology.py
) will be added in the coming weeks.
Breaking-changes
The migration section of the readme lists the breaking changes when switching from pytorch-pretrained-bert
to pytorch-transformers
.
The main breaking change is that all models now returns a tuple
of results.
Better model/tokenizer serialization, relax network connection requirements, new scripts and bug fixes
General updates:
- Better serialization for all models and tokenizers (BERT, GPT, GPT-2 and Transformer-XL) with best practices for saving/loading in readme and examples.
- Relaxing network connection requirements (fallback on the last downloaded model in the cache when we can't reach AWS to check eTag)
Breaking changes:
warmup_linear
method inOpenAIAdam
andBertAdam
is now replaced by flexible schedule classes for linear, cosine and multi-cycles schedules.
Bug fixes and improvements to the library modules:
- add a flag in BertTokenizer to skip basic tokenization (@john-hewitt)
- Allow tokenization of sequences > 512 (@CatalinVoss)
- clean up and extend learning rate schedules in BertAdam and OpenAIAdam (@lukovnikov)
- Update GPT/GPT-2 Loss computation (@CatalinVoss, @thomwolf)
- Make the TensorFlow conversion tool more robust (@marpaia)
- fixed BertForMultipleChoice model init and forward pass (@dhpollack)
- Fix gradient overflow in GPT-2 FP16 training (@SudoSharma)
- catch exception if pathlib not installed (@potatochip)
- Use Dropout Layer in OpenAIGPTMultipleChoiceHead (@pglock)
New scripts and improvements to the examples scripts:
- Add BERT language model fine-tuning scripts (@Rocketknight1)
- Added SST-2 task and remaining GLUE tasks to 'run_classifier.py' (@ananyahjha93, @jplehmann)
- GPT-2 generation fixes (@CatalinVoss, @spolu, @dhanajitb, @8enmann, @SudoSharma, @cynthia)
v0.6.1 - Small install tweak release
Add regex
to the requirements for OpenAI GPT-2 tokenizer.
v0.6.0 - Adding OpenAI small GPT-2 pretrained model
Add OpenAI small GPT-2 pretrained model
Bug fix update to load the pretrained `TransfoXLModel` from s3, added fallback for OpenAIGPTTokenizer when SpaCy is not installed
Mostly a bug fix update for loading the TransfoXLModel
from s3:
- Fixes a bug in the loading of the pretrained
TransfoXLModel
from the s3 dump (which is a convertedTransfoXLLMHeadModel
) in which the weights were not loaded. - Added a fallback of
OpenAIGPTTokenizer
on BERT'sBasicTokenizer
when SpaCy and ftfy are not installed. Using BERT'sBasicTokenizer
instead of SpaCy should be fine in most cases as long as you have a relatively clean input (SpaCy+ftfy were included to exactly reproduce the paper's pre-processing steps on the Toronto Book Corpus) and this also let us use thenever_split
option to avoid splitting special tokens like[CLS], [SEP]...
which is easier than adding the tokens after tokenization. - Updated the README on the tokenizers options and methods which was lagging behind a bit.
Adding OpenAI GPT and Transformer-XL pretrained models, python2 support, pre-training script for BERT, SQuAD 2.0 example
New pretrained models:
-
Open AI GPT pretrained on the Toronto Book Corpus ("Improving Language Understanding by Generative Pre-Training" by Alec Radford et al.).
- This is a slightly modified version of our previous PyTorch implementation to increase the performances by spliting words and position embeddings in separate embeddings matrices.
- Performance checked to be on part with the TF implementation on ROCStories: single run evaluation accuracy of 86.4% vs. authors reporting a median accuracy of 85.8% with the TensorFlow code (see details in the example section of the readme).
-
Transformer-XL pretrained on WikiText 103 ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context" by Zihang Dai, Zhilin Yang et al.). This is a slightly modified version of Google/CMU's PyTorch implementation to match the performances of the TensorFlow version by:
- untying relative positioning embeddings across layers,
- changing memory cells initialization to keep sinusoïdal positions identical
- adding full logits outputs in the adaptive softmax to use it in a generative setting.
- Performance checked to be on part with the TF implementation on WikiText 103: evaluation perplexity of 18.213 vs. authors reporting a perplexity of 18.3 on this dataset with the TensorFlow code (see details in the example section of the readme).
New scripts:
- Updated the SQuAD fine-tuning script to work also on SQuAD V2.0 by @abeljim and @Liangtaiwan
run_lm_finetuning.py
let you pretrain aBERT
language model or fine-tune it with masked-language-modeling and next-sentence-prediction losses by @deepset-ai, @tholor and @nhatchan (compatibility Python 3.5)
Backward compatibility:
- The library is now compatible with Python 2 also
Improvements and bug fixes:
- add a
never_split
option and arguments to the tokenizers (@WrRan) - better handle errors when BERT is feed with inputs that are too long (@patrick-s-h-lewis)
- better layer normalization layer initialization and bug fix in examples scripts: args.do_lower_case is always True(@donglixp)
- fix learning rate schedule issue in example scripts (@matej-svejda)
- readme fixes (@danyaljj, @nhatchan, @davidefiocco, @girishponkiya )
- importing unofficial TF models in BERT (@nhatchan)
- only keep the active part of the loss for token classification (@Iwontbecreative)
- fix argparse type error in example scripts (@ksurya)
- docstring fixes (@rodgzilla, @wlhgtc )
- improving
run_classifier.py
loading of saved models (@SinghJasdeep) - In examples scripts: allow do_eval to be used without do_train and to use the pretrained model in the output folder (@jaderabbit, @likejazz and @JoeDumoulin )
- in
run_squad.py
: fix error whenbert_model
param is path or url (@likejazz) - add license to source distribution and use entry-points instead of scripts (@sodre)
4x speed-up using NVIDIA apex, new multi-choice classifier and example for SWAG-like dataset, pytorch v1.0, improved model loading, improved examples...
New:
- 3-4 times speed-ups in fp16 (versus fp32) thanks to NVIDIA's work on apex (by @FDecaYed)
- new sequence-level multiple-choice classification model + example fine-tuning on SWAG (by @rodgzilla)
- improved backward compatibility to python 3.5 (by @hzhwcmhf)
- bump up to PyTorch 1.0
- load fine-tuned model with
from_pretrained
- add examples on how to save and load fine-tuned models.