I am doing conformer-transducer with multilingual ASR. Why does val loss produce NaN? #11311

SEOLJINYOUNG · 2024-11-18T01:20:22Z

SEOLJINYOUNG
Nov 18, 2024

Hello. I am doing multilingual ASR in English and Korean by referring to the tutorial.
Multilingual ASR

In this tutorial, the base model uses the stt_enes_contextnet_large pre-trained model,
In my case I use stt_en_conformer_transducer_small.

My problem is that it seems to be learning, but val loss returns NaN.
In the validation stage, the prediction comes out like this.

[train stage]

[valid stage]

I would be grateful if you could give me some advice regarding this.

This is the overall code I ran.
[code]

import os

import nemo.collections.asr as nemo_asr
from nemo.utils import exp_manager

from omegaconf import OmegaConf
import torch
import torch.nn as nn
import pytorch_lightning as ptl

def enable_bn_se(m):
    if type(m) == nn.BatchNorm1d:
        m.train()
        for param in m.parameters():
            param.requires_grad_(True)

ENGLISH_TOKENIZER_DIR = 'tokenizers/en'
KOREAN_TOKENIZER_DIR = 'tokenizers/ko'

new_tokenizer_cfg = OmegaConf.create({'type': 'agg', 'langs': {}}) english_tokenizer_cfg = OmegaConf.create({'dir': ENGLISH_TOKENIZER_DIR + '/tokenizer_spe_bpe_v128', 'type': 'bpe'})
korean_tokenizer_cfg = OmegaConf.create({'dir': KOREAN_TOKENIZER_DIR + '/tokenizer_spe_bpe_v2110', 'type': 'bpe'})
new_tokenizer_cfg.langs['en'] = english_tokenizer_cfg
new_tokenizer_cfg.langs['ko'] = korean_tokenizer_cfg

asr_model.change_vocabulary(
        new_tokenizer_dir=new_tokenizer_cfg,
       new_tokenizer_type="agg",
 )

asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="stt_en_conformer_transducer_small")


GRAD_ACCUM=1
MAX_EPOCHS=5
GPUS=[0]
LOG_EVERY_N_STEPS=10

trainer = ptl.Trainer(devices=GPUS, 
                      accelerator="gpu",
                      max_epochs=MAX_EPOCHS, 
                      accumulate_grad_batches=GRAD_ACCUM,
                      precision=16,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=LOG_EVERY_N_STEPS,
                      enable_progress_bar=True,
                      check_val_every_n_epoch=1)

asr_model.set_trainer(trainer)

train_manifest_en = 'train_final_manifest_en2.json'
train_manifest_ko = 'train_final_manifest_ko2.json'

train_ds = {}
train_ds['manifest_filepath'] = [train_manifest_en,train_manifest_ko]
train_ds['sample_rate'] = 16000
train_ds['batch_size'] = 16
train_ds['fused_batch_size'] = 16
train_ds['shuffle'] = True
train_ds['max_duration'] = 16.7
train_ds['pin_memory'] = True
train_ds['is_tarred'] = False
train_ds['num_workers'] = 4

asr_model.setup_training_data(train_data_config=train_ds)  

valid_manifest_en = 'valid_final_manifest_en2.json'
valid_manifest_ko = 'valid_final_manifest_ko2.json'

validation_ds = {}
validation_ds['sample_rate'] = 16000
validation_ds['manifest_filepath'] = [valid_manifest_en,valid_manifest_ko]
validation_ds['batch_size'] = 32
validation_ds['shuffle'] = False
validation_ds['num_workers'] = 4

asr_model.setup_multiple_validation_data(val_data_config=validation_ds)

optimizer_conf = {}

optimizer_conf['name'] = 'adamw'
optimizer_conf['lr'] = 0.01
optimizer_conf['betas'] =  [0.9, 0.98]
optimizer_conf['weight_decay'] = 0

sched = {}
sched['name'] = 'CosineAnnealing'
sched['warmup_steps'] = None
sched['warmup_ratio'] = 0.10
sched['min_lr'] = 1e-6
optimizer_conf['sched'] = sched

asr_model.setup_optimization(optimizer_conf)

asr_model.encoder.freeze()
asr_model.encoder.apply(enable_bn_se)

asr_model.wer.user_cer = True
asr_model.wer.log_predictions = True
asr_model.compute_eval_loss = True

config = exp_manager.ExpManagerConfig(
    exp_dir=f'experiments/multi/',
    name=f"ASR-Model-multi",
    checkpoint_callback_params=exp_manager.CallbackParams(
        monitor="val_wer",
        mode="min",
        always_save_nemo=True,
        save_best_model=True,
    ),
    create_tensorboard_logger=True,
)

config = OmegaConf.structured(config)

logdir = exp_manager.exp_manager(trainer, config)

trainer.fit(asr_model)

The dataset has the following sizes:

I stopped learning in progress.
[train_loss]

[val_loss]

jeremy110 · 2024-11-18T01:55:07Z

jeremy110
Nov 18, 2024

I'm not sure if this is the issue, but in my case, changing the precision to 32 or bf16 allows the loss curve to converge properly. Also, your learning rate seems a bit high.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I am doing conformer-transducer with multilingual ASR. Why does val loss produce NaN? #11311

{{title}}

Replies: 1 comment

{{title}}

Select a reply

I am doing conformer-transducer with multilingual ASR. Why does val loss produce NaN? #11311

SEOLJINYOUNG Nov 18, 2024

Replies: 1 comment

jeremy110 Nov 18, 2024

SEOLJINYOUNG
Nov 18, 2024

jeremy110
Nov 18, 2024