Runtime cuDNN error when training custom non-Latin-character model #1345

S0mbre · 2024-12-08T21:35:28Z

When training a custom model using the provided training script, in a Google Colab environment, I constatnly get the following cuDNN errors:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-9-2d2174729bbd>](https://localhost:8080/#) in <cell line: 13>()
     11 
     12 # force_cudnn_initialization()
---> 13 train(opt, amp=False)

5 frames
[/content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/train.py](https://localhost:8080/#) in train(opt, show_number, amp)
    233                 with torch.no_grad():
    234                     valid_loss, current_accuracy, current_norm_ED, preds, confidence_score, labels,\
--> 235                     infer_time, length_of_data = validation(model, criterion, valid_loader, converter, opt, device)
    236                 model.train()
    237 

[/content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/test.py](https://localhost:8080/#) in validation(model, criterion, evaluation_loader, converter, opt, device)
     41             preds_size = torch.IntTensor([preds.size(1)] * batch_size)
     42             # permute 'preds' to use CTCloss format
---> 43             cost = criterion(preds.log_softmax(2).permute(1, 0, 2), text_for_loss, preds_size, length_for_loss)
     44 
     45             if opt.decode == 'greedy':

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/loss.py](https://localhost:8080/#) in forward(self, log_probs, targets, input_lengths, target_lengths)
   1978         target_lengths: Tensor,
   1979     ) -> Tensor:
-> 1980         return F.ctc_loss(
   1981             log_probs,
   1982             targets,

[/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in ctc_loss(log_probs, targets, input_lengths, target_lengths, blank, reduction, zero_infinity)
   3067             zero_infinity=zero_infinity,
   3068         )
-> 3069     return torch.ctc_loss(
   3070         log_probs,
   3071         targets,

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Software versions

Python 3.10
nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvcc-cu12==12.6.85
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.6.0.74
nvidia-cufft-cu12==11.3.0.4
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-nccl-cu12==2.23.4
nvidia-nvjitlink-cu12==12.6.85
torch @ https://download.pytorch.org/whl/cu121_full/torch-2.5.1%2Bcu121-cp310-cp310-linux_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp310-cp310-linux_x86_64.whl

Model train config

number: 0123456789
symbol: .,: 
lang_char: АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя
experiment_name: ru_filtered
train_data: /content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/all_data
valid_data: /content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/all_data/ru_val
manualSeed: 1111
workers: 1
batch_size: 32
num_iter: 30000
valInterval: 200
saved_model: /content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/saved_models/ru_filtered/cyrillic_g2.pth
FT: False
optim: False
lr: 1.0
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['train', 'val']
batch_ratio: ['0.8', '0.2']
total_data_usage_ratio: 1.0
batch_max_length: 34
imgH: 64
imgW: 600
rgb: False
contrast_adjust: 0.0
sensitive: True
PAD: True
data_filtering_off: False
Transformation: None
FeatureExtraction: VGG
SequenceModeling: BiLSTM
Prediction: CTC
num_fiducial: 20
input_channel: 1
output_channel: 256
hidden_size: 256
decode: greedy
new_prediction: True    # !!! HAD TO SET TO TRUE BECAUSE OF SIZE MISMATCH ERROR
freeze_FeatureFxtraction: False
freeze_SequenceModeling: False
character: 0123456789.,: АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя
num_class: 80

Please also see above: when setting to train from pretrained model ("cyrillic_g2.pth") if I set "new_prediction" to False, I get the following error:

size mismatch for module.Prediction.weight: copying a param with shape torch.Size([208, 256]) from checkpoint, the shape in current model is torch.Size([80, 256])

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime cuDNN error when training custom non-Latin-character model #1345

Runtime cuDNN error when training custom non-Latin-character model #1345

S0mbre commented Dec 8, 2024

Runtime cuDNN error when training custom non-Latin-character model #1345

Runtime cuDNN error when training custom non-Latin-character model #1345

Comments

S0mbre commented Dec 8, 2024

Software versions

Model train config