Unofficial PyTorch implementation of PnG BERT with some changes.
Dubbed "Phoneme and Grapheme and Word BERT", this model includes additional word-level embeddings on both grapheme and phoneme side of the model.
Also does include additional text-to-emoji objective using DeepMoji teacher model.
I no longer recommend using PnG BERT or this modified version because of the high compute costs.
Since each input is chars+phonemes instead of just wordpieces, the input length is around 6x longer than BERT.
With dot-prod attention scaling with the square of the input length, the attention is theoretically 36x more expensive in PnG BERT than normal BERT.
Here's the modified architecture.
New stuff is
Word Values Embeddings
Rel Word and Rel Token Position Embeddings
Subword Position Embeddings
Emoji Teacher Loss
The position embeddings are configurable in the config and I will likely disable some of them once I find the best configuration for training.
Update 19th Feb
I tested 5% Trained PnGnW BERT checkpoint with Tacotron2 Decoder.
Alignment Achieved in 300k samples, about 80% faster than the original tacotron2 text encoder [1].
I'll look into adding Flash Attention next since training is taking longer than I'd like.
Update 3rd March
- added Flash Attention
- Trained Tacotron2, Prosody Prediction and Prosody-to-Mel models with PnGnW BERT
- Experimented with different Position Embedding (Learned Embedding vs Sinusoidal Embedding)
I found that - in downstream TTS tasks - fine-tuned PnGnW BERT is about on par with fine-tuning normal BERT + using DeepMoji + using G2p, while requiring much more VRAM and compute.
I can't recommend using this repo. The idea sounded really cool but after experimenting, it seems like the only benefit to this method is simplifying the pipeline by using a single model instead of multiple smaller models. There is no noticeable improvement in quality (which makes me really sad) and it requires 10x~ more compute.
It's still possible that this method will help a lot with accented speakers or other more challenging cases, but for normal English speakers it's just not worth it.