Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird alignment images and bad sound even after 100k steps. #27

Open
Ittiz opened this issue Mar 24, 2019 · 14 comments
Open

Weird alignment images and bad sound even after 100k steps. #27

Ittiz opened this issue Mar 24, 2019 · 14 comments

Comments

@Ittiz
Copy link

Ittiz commented Mar 24, 2019

So I've been trying to train this on the LJSpeech data set since it seemed like the most solid one out there. However I've been having an issue where there is a messy band across the alignment even after it finds alignment the disturbance remains. The step audios sound great after 20k steps or so, but if you synthesize using the demo server it sound like garbled junk even after 100k steps. Here are some alignment images from when I did 100k on the CPU:

step-1000-align
This is step 1k, you can already see the band.

step-50000-align
50k steps, band still there, but starting to align finally!

step-100000-align
100k step, alignment improving, but band still there!

I used the default hparams for the CPU run. Then I decided to use the GPU. The GPU is a K620 with limited memory at 2Gb. So I had to set the hparams like this to not OOM:

    # Audio:
    num_mels=80,
    num_freq=1025,
    min_mel_freq=125,
    max_mel_freq=7600,
    sample_rate=22050,
    frame_length_ms=50,
    frame_shift_ms=12.5,
    min_level_db=-100,
    ref_level_db=20,

    #MAILABS trim params
    trim_fft_size=1024,
    trim_hop_size=256,
    trim_top_db=40,

    # Model:
    # TODO: add more configurable hparams
    outputs_per_step=5,
    embedding_dim=512,

    # Training:
    batch_size=16,
    adam_beta1=0.9,
    adam_beta2=0.999,
    initial_learning_rate=0.0015,
    learning_rate_decay_halflife=100000,
    use_cmudict=False,   # Use CMUDict during training to learn pronunciation of ARPAbet phonemes

    # Eval:
    max_iters=200,
    griffin_lim_iters=50,
    power=1.5,    

Note I had to bring the batch size down to conserve memory and I also changed the frequency to 22050 from 22000 because that's what was listed in the data set. I thought that may be the issue. I only ran it for 12 hours so I didn't get a lot of steps but here are the results:

step-1000-align
1k steps, hmm looks like that dang band again to me!

step-12000-align
12k steps, this is where I stopped it because I didn't want to bother wasting more time, but the band is still there looking stronger than ever!

Anyone have any clue what could be causing this issue? Is there anything I can tweak in the hparams to correct this? Could it be an issue with the code? On a side note if I use the demo server or listen to the alignment clips, they are much MUCH louder than the sample data. I'm not sure if that's related or controllable some how.

@el-tocino
Copy link

Your GPU needs more ram to be doing training. Per other tacotron repo comments, you should try and have batch size 32 or above to get alignment.

I also get the triangle charts with my dataset, not sure what causes that.

@Ittiz
Copy link
Author

Ittiz commented Mar 26, 2019

Like I said I did over 100k steps using batch size of 32 with CPUs and 48gbs of ram. Same problem whether I've got the CPU or GPU. I just tried the CPU with a batch size of 64 and got the same issue. A band shows across the top:

step-3000-align

Again, I'm not sure if it's relevant but the wavs it generates are WAY louder than the original training wavs. I have to turn the training wavs volumes to about 75% to hear them well on vlc and mplayer. The clips generated by Mimic2 I have to set to 3% and it's still twice as loud as the training clips at 75%.

Not sure what you mean by triangles? I'm complaining about the band that never goes away! You can see it in all my alignment images.

@el-tocino
Copy link

el-tocino commented Mar 26, 2019

See your 100k step picture for what I mean by triangle. Tends to echo out still when that occurs. I'm not sure why yours ending up with the test wavs being loud. I haven't been able to train LJ on mimic2 successfully, though one of the mycroft folks said he was able to do.

@Ittiz
Copy link
Author

Ittiz commented Mar 26, 2019

That just means that the training is working: keithito#144

So you've tried training in the LJs data set as well? Did you get the same interference I'm getting in your alignment graphs and sound synthases?

@el-tocino
Copy link

It doesn't work, though. The generated samples from models that have the weird align/fuzzy bar triangley thing end up being either filled echo or lose coherency quickly. Aligned models from previous iterations of tacotron/mimic2 I've run haven't had those issues, and their alignment charts are much closer to ideal (ie, just a line going bottom left to upper right).

@Ittiz
Copy link
Author

Ittiz commented Mar 26, 2019

So what changed that is creating this issue? Do you know when around this issue started to crop up?

@el-tocino
Copy link

There was a bunch of stuff updated last September or so. For a test, try the following. Preprocess all your data with the mimic2 repo. Then, clone keithito's tacotron repo and use it to do the training with for 25k or so, by which time you should see normal alignment.

@Ittiz
Copy link
Author

Ittiz commented Mar 26, 2019

Mimic2 crabs about "bias not found in checkpoint" even when the data was preprocessed with Mimic2. I guess I'll have to figure this out on my own.

@el-tocino
Copy link

Did you clear out previous training run step/model/checkpoints?

@Ittiz
Copy link
Author

Ittiz commented Mar 26, 2019

I cut and paste them into a different folder.

@Ittiz
Copy link
Author

Ittiz commented Mar 27, 2019

Okie dokie! seems I fixed it by merging Mimic2 and Kiethito's repos in my own fork. I'm not home so I haven't listened to it yet but the alignment is looking much better. If all sounds well I'll push the changes to my fork and other people can test it out.

step-23000-align

@Ittiz
Copy link
Author

Ittiz commented Mar 29, 2019

So more issues, the thing is still super loud. Also it only aligns some times, even with my modifications. I have a feeling the remaining issues are volume related. For now I'm just going to use Tacotron. So I modified Tacotron so I can use it to interface with MyCroft and it seems to be working.

@Ruthvicp
Copy link

Ruthvicp commented Apr 4, 2019

I have used a different dataset (private) and trained for 18k steps using the existing mimic2 (master branch). I was able to get good alignment and also a decent voice.

image

Could you please share your plots generated using this

@Ittiz
Copy link
Author

Ittiz commented Apr 4, 2019

The plots above were generated using mimic2 on the LJs data set. It could be there is something going on in particular with the LJs data set that makes it not work well Mimic2, I dunno.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants