Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing labels in training / decoding in tf_clean branch #169

Open
martiansideofthemoon opened this issue Jan 28, 2018 · 13 comments
Open

Missing labels in training / decoding in tf_clean branch #169

martiansideofthemoon opened this issue Jan 28, 2018 · 13 comments

Comments

@martiansideofthemoon
Copy link

Hello,
I am trying to run the TensorFlow based EESEN setup for Switchboard. More specifically, I am using the tf_clean branch and trying to run the asr_egs/swbd/v1-tf/run_ctc_char.sh script. I am having some trouble with the training and decoding steps, would appreciate your help! @ramonsanabria , @fmetze

During the stage 3 (training), I get a number of error messages of the form -

********************************************************************************
********************************************************************************
Warning: sw02018-B_012508-012721 has not been found in labels file: /scratch/tmp.1hi5uR4EIR/labels.cv
********************************************************************************
********************************************************************************

Here are the training logs that follow. I suspect creating tr_y from scratch is a problem?

cleaning done: /scratch/tmp.1hi5uR4EIR/cv_local.scp
original scp length: 4000
scp deleted: 270
final scp length: 3730
number of labels not found: 270
TRAINING STARTS [2018-Jan-28 06:02:05]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
('now:', 'Sun 2018-01-28 06:02:08')
('tf:', '1.1.0')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading training set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
tr_x:
--------------------------------------------------------------------------------
non augmented (mix) training set found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) train batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
tr_y:
--------------------------------------------------------------------------------
creating tr_y from scratch...
unilanguage setup detected (in labels)... 

--------------------------------------------------------------------------------
cv_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or  set language... 

cv (feats) found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) cv batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
cv_y:
--------------------------------------------------------------------------------
creating cv_y from scratch...
unilanguage setup detected (in labels)... 

languages checked ...
(cv_x vs cv_y vs tr_x vs tr_y)

Finally here are my decoding logs -

(python2.7_tf1.4) kalpesh@kalpesh:v1-tf$ ./run_ctc_char.sh 
=====================================================================
                   Decoding eval200 using AM                      
=====================================================================
./steps/decode_ctc_am_tf.sh --config exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl --data ./data/eval2000/ --weights exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/epoch25.ckpt --results exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/results/epoch25
exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file
copy-feats 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- |' ark,scp:/scratch/tmp.GgS1if0Wex/f.ark,/scratch/tmp.GgS1if0Wex/test_local.scp 
apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- 
LOG (apply-cmvn[5.3.85~1-35950]:main():apply-cmvn.cc:159) Applied cepstral mean and variance normalization to 4458 utterances, errors on 0
LOG (copy-feats[5.3.85~1-35950]:main():copy-feats.cc:143) Copied 4458 feature matrices.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
('now:', 'Sun 2018-01-28 06:05:28')
('tf:', '1.1.0')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading testing set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or  set language... 

test (feats) found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) test batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_y (for ter computation):
--------------------------------------------------------------------------------
unilanguage setup detected (in labels)... 

no label files fins in /scratch/tmp.GgS1if0Wex with info_set: test
file: /share/data/lang/users/kalpesh/eesen/tf/ctc-am/reader/labels_reader/labels_reader.py function: __read_one_language line: 171
exiting...

Here are my logs from the first two stages (data preparation, fbank generation)

(python2.7_tf1.4) kalpesh@kalpesh:v1-tf$ ./run_ctc_char.sh 
=====================================================================
                       Data Preparation                            
=====================================================================
Switchboard-1 data preparation succeeded.
utils/fix_data_dir.sh: filtered data/train/segments from 264333 to 264072 lines based on filter /scratch/tmp.V26jBobg4D/recordings.
utils/fix_data_dir.sh: filtered /scratch/tmp.V26jBobg4D/speakers from 4876 to 4870 lines based on filter data/train/cmvn.scp.
utils/fix_data_dir.sh: filtered data/train/spk2utt from 4876 to 4870 lines based on filter /scratch/tmp.V26jBobg4D/speakers.
fix_data_dir.sh: kept 263890 utterances out of 264072
fix_data_dir.sh: old files are kept in data/train/.backup
Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.
Character-based dictionary (word spelling) preparation succeeded
Warning: for utterances en_4910-B_013563-013763 and en_4910-B_013594-013790, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_025539-025791 and en_4910-B_025541-025674, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_032263-032658 and en_4910-B_032299-032406, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_035678-035757 and en_4910-B_035715-035865, segments already overlap; leaving these times unchanged.
Data preparation and formatting completed for Eval 2000
(but not MFCC extraction)
fix_data_dir.sh: kept 4458 utterances out of 4466
fix_data_dir.sh: old files are kept in data/eval2000/.backup
=====================================================================
                    FBank Feature Generation                       
=====================================================================
steps/make_fbank.sh --cmd run.pl --nj 32 data/train exp/make_fbank_pitch/train fbank_pitch
steps/make_fbank.sh: moving data/train/feats.scp to data/train/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_fbank.sh [info]: segments file exists: using that.
Succeeded creating filterbank features for train
steps/compute_cmvn_stats.sh data/train exp/make_fbank_pitch/train fbank_pitch
Succeeded creating CMVN stats for train
fix_data_dir.sh: kept all 263890 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
steps/make_fbank.sh --cmd run.pl --nj 10 data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch
steps/make_fbank.sh: moving data/eval2000/feats.scp to data/eval2000/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/eval2000
steps/make_fbank.sh [info]: segments file exists: using that.
Succeeded creating filterbank features for eval2000
steps/compute_cmvn_stats.sh data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch
Succeeded creating CMVN stats for eval2000
fix_data_dir.sh: kept all 4458 utterances.
fix_data_dir.sh: old files are kept in data/eval2000/.backup
utils/subset_data_dir.sh: reducing #utt from 263890 to 4000
utils/subset_data_dir.sh: reducing #utt from 263890 to 259890
utils/subset_data_dir.sh: reducing #utt from 259890 to 100000
Reduced number of utterances from 100000 to 76615
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept 76615 utterances out of 100000
fix_data_dir.sh: old files are kept in data/train_100k_nodup/.backup
Reduced number of utterances from 259890 to 192701
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept 192701 utterances out of 259890
fix_data_dir.sh: old files are kept in data/train_nodup/.backup
@fmetze
Copy link
Contributor

fmetze commented Jan 29, 2018 via email

@martiansideofthemoon
Copy link
Author

martiansideofthemoon commented Feb 2, 2018

Hi @fmetze ,

"Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.”

Yes, I hadn't run the ph recipe. This error disappears on doing this. Do I need to run a decoding with the ph recipe too?

There is also "exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file”

This is an irrelevant error, it's happening since a pickle configuration file is sourced in the utils/parse_options.sh script. It does not affect further execution.

which means maybe the training didn’t start correctly, or not at all?

The training did happen successfully. Here are the training logs. As a confirmation, is it usual for the Kaldi setup to discard 270 dev utterances, 11 eval2000 utterances and 973 train utterances due to transcripts like [vocalized-noise]?

for language: no_name_language
following variables will be optimized: 
--------------------------------------------------------------------------------
<tf.Variable 'cudnn_lstm/params:0' shape=<unknown> dtype=float32_ref>
<tf.Variable 'output_layers/output_fc_no_name_language_no_target_name/weights:0' shape=(640, 42) dtype=float32_ref>
<tf.Variable 'output_layers/output_fc_no_name_language_no_target_name/biases:0' shape=(42,) dtype=float32_ref>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
[2018-02-01 11:31:59] Epoch 1 starting, learning rate: 0.03
[2018-02-01 12:23:40] Epoch 1 finished in 52 minutes
		Train    cost: 86.2, ter: 35.6%, #example: 491721
		Validate cost: 45.4, ter: 24.9%, #example: 11190
('not updating learning rate, parameters', 8, 0.0005)
--------------------------------------------------------------------------------
....
....
[2018-02-02 07:10:09] Epoch 23 starting, learning rate: 0.0005
[2018-02-02 08:05:53] Epoch 23 finished in 56 minutes
		Train    cost: 8.1, ter: 3.4%, #example: 491721
		Validate cost: 37.9, ter: 15.3%, #example: 11190
('not updating learning rate, parameters', 8, 0.0005)
--------------------------------------------------------------------------------

However, the decoding does not seem to budge. Here are the logs. The suspicious lines seem to be no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test and no_name_language. One important point here is that I am starting the bash script directy from the decoding stage (stage 4). Is it necessary to re-run stage 1 or 2 after I have a trained model?

=====================================================================
                   Decoding eval200 using AM                      
=====================================================================
./steps/decode_ctc_am_tf.sh --config exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl --data ./data/eval2000/ --weights exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/epoch14.ckpt --results exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/results/epoch14
exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file
copy-feats 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- |' ark,scp:/scratch/tmp.jihiXHPJkp/f.ark,/scratch/tmp.jihiXHPJkp/test_local.scp 
apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- 
LOG (apply-cmvn[5.3.85~1-35950]:main():apply-cmvn.cc:159) Applied cepstral mean and variance normalization to 4458 utterances, errors on 0
LOG (copy-feats[5.3.85~1-35950]:main():copy-feats.cc:143) Copied 4458 feature matrices.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.14 |Anaconda, Inc.| (default, Dec  7 2017, 17:05:42) 
[GCC 7.2.0]
('now:', 'Fri 2018-02-02 09:13:09')
('tf:', '1.4.0-rc1')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading testing set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or  set language... 

test (feats) found for language: no_name_language ... 

preparing dictionary for no_name_language...

ordering all languages (from scratch) test batches... 

Augmenting data x3 and win 3...

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_y (for ter computation):
--------------------------------------------------------------------------------
unilanguage setup detected (in labels)... 

no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test
file: /share/data/lang/users/kalpesh/eesen/tf/ctc-am/reader/labels_reader/labels_reader.py function: __read_one_language line: 171
exiting...

@fmetze
Copy link
Contributor

fmetze commented Feb 3, 2018

Good, not sure about the pickle error, but if you say it does not affect the training, then things should be fine. You should be fine running the test script from stage 4 only for decoding, the data should already be prepared. @ramonsanabria - any ideas about v1-tf here?

@ramonsanabria
Copy link

Hi,

The pickle error is irrelevant. The configuration is loaded properly. I will try to remove it as soon as I have time.

@xinjli is cleaning up the swbd recipie. I have some experiments with different char-based units (removing numbers and noises) that for now seems to be improving a bit.

@xinjli
Copy link

xinjli commented Feb 7, 2018

I also found the issue that char recipe could not run without the phn recipe today. The same issue also happens in the swbd v1 recipe under the master branch. I will prepare a fix for this issue.

@martiansideofthemoon
Copy link
Author

Hi @ramonsanabria , @xinjli
Any idea about the no label files fins in /scratch/tmp.jihiXHPJkp with info_set: test error I am receiving?

@ramonsanabria
Copy link

ramonsanabria commented Feb 7, 2018 via email

@martiansideofthemoon
Copy link
Author

martiansideofthemoon commented Feb 7, 2018

@ramonsanabria yes, I can find it.

kalpesh@kalpesh:kalpesh$ ls /scratch/tmp.jihiXHPJkp
f.ark  test_local.scp
kalpesh@kalpesh:kalpesh$

I checked the code, the system searches for a file named labels.test, but fails to find it. I tried to use the ./local/swbd1_prepare_phn_dict_tf.py script to generate the test labels (like in the case of training data), but I obtain an empty labels file. I used to use a special hubscr.pl script to generate a detailed output using the raw decoded transcripts (in my previous setup).

What is the correct way to integrate this script into EESEN?

@xinjli
Copy link

xinjli commented Feb 7, 2018

I think we need a stage to generate labels.test for testing. It seems that we do not have any script for this now.
I think we need something like
python ./local/swbd1_prepare_char_dict_tf.py --text_file ./data/train_nodup/text --input_units ./data/local/dict_char/units.txt --output_labels $dir_am/labels.tr --lower_case --ignore_noises || exit 1 After preparing labels.tr and labels.cv

@xinjli
Copy link

xinjli commented Feb 7, 2018

Probably we can use following code to generate labels.test

python ./local/swbd1_prepare_char_dict_tf.py --text_file ./data/eval2000/text --input_units ./data/local/dict_char/units.txt --output_labels $dir_am/labels.test

eval2000 contains the text we need for evaluation and just replace $dir_am with variable in your environment

@ramonsanabria
Copy link

ramonsanabria commented Feb 7, 2018 via email

@martiansideofthemoon
Copy link
Author

martiansideofthemoon commented Feb 7, 2018

@ramonsanabria could you describe the process you are using the compute the final WER of a trained model?

I guess this is often called "scoring", in the Kaldi setup. Generally, raw transcripts are fed into hubscr.pl to generate a number of detailed output files, with the final SWBD, CH, Combined WER mentioned in a *.lur file.

@martiansideofthemoon
Copy link
Author

martiansideofthemoon commented Feb 22, 2018

Hi @ramonsanabria any update on ^? Also, how have you treated the space character? I cannot find an entry for the space in data/local/dict_char/units.txt. (Note I'm referring to the <space> character, not the CTC blank symbol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants