Unofficial PyTorch implementation of PnG BERT with some changes.
Dubbed "Phoneme and Grapheme and Word BERT", this model includes additional word-level embeddings on both grapheme and phoneme side of the model.
Also does include additional text-to-emoji objective using DeepMoji teacher model.
I no longer recommend using PnG BERT or this modified version because of the high compute costs.
Since each input is chars+phonemes instead of just wordpieces, the input length is around 6x longer than BERT.
With dot-prod attention scaling with the square of the input length, the attention is theoretically 36x more expensive in PnG BERT than normal BERT.
Here's the modified architecture.
New stuff is
-
Word Values Embeddings
-
Rel Word and Rel Token Position Embeddings
-
Subword Position Embeddings
-
Emoji Teacher Loss
The position embeddings are configurable in the config and I will likely disable some of them once I find the best configuration for training.
Update 19th Feb
I tested 5% Trained PnGnW BERT checkpoint with Tacotron2 Decoder.
Alignment Achieved in 300k samples, about 80% faster than the original tacotron2 text encoder [1].
I'll look into adding Flash Attention next since training is taking longer than I'd like.
[1] - LOCATION-RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG-FORM SPEECH SYNTHESIS
Update 3rd March
I've;
- added Flash Attention
- Trained Tacotron2, Prosody Prediction and Prosody-to-Mel models with PnGnW BERT
- Experimented with different Position Embedding (Learned Embedding vs Sinusoidal Embedding)
I found that - in downstream TTS tasks - fine-tuned PnGnW BERT is about on par with fine-tuning normal BERT + using DeepMoji + using G2p, while requiring much more VRAM and compute.
I can't recommend using this repo. The idea sounded really cool but after experimenting, it seems like the only benefit to this method is simplifying the pipeline by using a single model instead of multiple smaller models. There is no noticeable improvement in quality (which makes me really sad) and it requires 10x~ more compute.
It's still possible that this method will help a lot with accented speakers or other more challenging cases, but for normal English speakers it's just not worth it.