Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Librispeech - Training starting error #221

Open
omprakashsonie opened this issue May 9, 2020 · 3 comments
Open

Librispeech - Training starting error #221

omprakashsonie opened this issue May 9, 2020 · 3 comments

Comments

@omprakashsonie
Copy link

Hi,
it stops after this message:
[NOTE] TOKEN_ACCURACY refers to token accuracy, i.e., (1.0 - token_error_rate).
EPOCH 1 RUNNING ... TAG: 10_1.0 lrate 4e-05,

Brief error:
ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 48 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()'

in python it finds device

torch.cuda.is_available()
True

Detailed error:
./exp/nml_seq_fw_seq_tw/train_lstm/log/tr.iter1.log
train-ctc-parallel --report-step=1000 --num-sequence=10 --frame-limit=25000 --learn-rate=0.00004 --momentum=0.9 --verbose=1 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./exp/nml_seq_fw_seq_tw/train_tr95/utt2spk scp:./exp/nml_seq_fw_seq_tw/train_tr95/cmvn.scp scp:./exp/nml_seq_fw_seq_tw/train_lstm/train_10_1.0.scp ark:- | splice-feats --left-context=1 --right-context=1 ark:- ark:- | add-deltas ark:- ark:- | subsample-feats --n=3 --offset=1 ark:- ark:- |' 'ark:gunzip -c ./exp/nml_seq_fw_seq_tw/train_lstm/labels.tr.gz|' ./exp/nml_seq_fw_seq_tw/train_lstm/nnet/nnet.iter0 ./exp/nml_seq_fw_seq_tw/train_lstm/nnet/nnet.iter1

LOG (train-ctc-parallel:SelectGpuIdAuto():cuda-device.cc:262) Selecting from 1 GPUs
LOG (train-ctc-parallel:SelectGpuIdAuto():cuda-device.cc:277) cudaSetDevice(0): Tesla V100-PCIE-16GB free:15724M, used:428M, total:16152M, free/total:0.973502
LOG (train-ctc-parallel:SelectGpuIdAuto():cuda-device.cc:310) Selected device: 0 (automatically)
LOG (train-ctc-parallel:FinalizeActiveGpu():cuda-device.cc:194) The active GPU is [0]: Tesla V100-PCIE-16GB free:15676M, used:476M, total:16152M, free/total:0.97053 version 7.0
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:334) Memory used: 0 bytes.
LOG (train-ctc-parallel:DisableCaching():cuda-device.cc:731) Disabling caching of GPU memory.
LOG (train-ctc-parallel:SetUpdateAlgorithm():net.cc:483) Selecting SGD with momentum as optimization algorithm.
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 0
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 1
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 2
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 3

add-deltas ark:- ark:-
splice-feats --left-context=1 --right-context=1 ark:- ark:-
apply-cmvn --norm-vars=true --utt2spk=ark:./exp/nml_seq_fw_seq_tw/train_tr95/utt2spk scp:./exp/nml_seq_fw_seq_tw/train_tr95/cmvn.scp scp:./exp/nml_seq_fw_seq_tw/train_lstm/train_10_1.0.scp ark:-
subsample-feats --n=3 --offset=1 ark:- ark:-
LOG (train-ctc-parallel:main():train-ctc-parallel.cc:133) TRAINING STARTED

ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 48 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()'
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe gunzip -c ./exp/nml_seq_fw_seq_tw/train_lstm/labels.tr.gz| had nonzero return status 13
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:./exp/nml_seq_fw_seq_tw/train_tr95/utt2spk scp:./exp/nml_seq_fw_seq_tw/train_tr95/cmvn.scp scp:./exp/nml_seq_fw_seq_tw/train_lstm/train_10_1.0.scp ark:- | splice-feats --left-context=1 --right-context=1 ark:- ark:- | add-deltas ark:- ark:- | subsample-feats --n=3 --offset=1 ark:- ark:- | had nonzero return status 36096

ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 48 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()'

[stack trace: ]
eesen::KaldiGetStackTraceabi:cxx11
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::CuMatrixBase::AddVecToRows(float, eesen::CuVectorBase const&, float)
eesen::BiLstmParallel::PropagateFncRecurrentDropoutPassForward(eesen::CuMatrixBase const&, int, int)
eesen::BiLstmParallel::PropagateFnc(eesen::CuMatrixBase const&, eesen::CuMatrixBase)
eesen::Layer::Propagate(eesen::CuMatrixBase const&, eesen::CuMatrix
)
eesen::Net::Propagate(eesen::CuMatrixBase const&, eesen::CuMatrix*)
train-ctc-parallel(main+0x1494) [0x434c48]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fb368c82830]
train-ctc-parallel(_start+0x29) [0x432119]

@LambertFan
Copy link

LambertFan commented Feb 3, 2021

Add capablity for your GPUs in src/gpucompute/Makefile.

I add capablity 7.0 slove this error for V100.

  CUDA_VER_GT_9_0 := $(shell [ $(CUDA_VERSION) -ge 90 ] && echo true)
  ifeq ($(CUDA_VER_GT_9_0), true)
    CUDA_ARCH += -gencode arch=compute_70,code=sm_70
  endif

@liuanping
Copy link

Add capablity for your GPUs in src/gpucompute/Makefile.

I add capablity 7.0 slove this error for V100.

  CUDA_VER_GT_9_0 := $(shell [ $(CUDA_VERSION) -ge 90 ] && echo true)
  ifeq ($(CUDA_VER_GT_9_0), true)
    CUDA_ARCH += -gencode arch=compute_70,code=sm_70
  endif

so your GPU is V100 cuda version 9.0 ? do you need to install altas
i met the error with A100 cuda 9.0 "cudaError_t 13 : "invalid device symbol" returned from 'cublasGetError()'“

@LIZHICHAOUNICORN
Copy link

LIZHICHAOUNICORN commented Aug 16, 2021

My env:
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.0, Nvidia TITAN Xp

It work, when I add one line at src/gpucompute/Makefile :

CUDA_ARCH=-gencode arch=compute_61,code=sm_61

it has been verify compute_30 sm_30 or compute_80 sm_80 does't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants