-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lack of validation set? #13
Comments
Hi, It's an interesting question. We did have such kind of discussion at the early stages. We used to run validation during training and found the validation loss would be extremely high and could not reflect the quality of generated results. The conclusion is that "overfitting" is somewhat an important or necessary factor of a good generative LM model. Models with higher validation loss might generate better results because they have higher probabilities of "remembering" good sentences from humans. I recall that a paper mentioned this phenomenon as well (but I forget its title...). Furthermore, the quality hugely depends on another factor - "sampling" at the "inference" stage. Combining the two factors, we considered that the runtime validation loss might not be very useful, so we discarded it in every following work. |
Hi, Thanks for the detailed reply. I remember in a beginner course project where I supervised some students training the bach chorale dataset using CNN. The results turn out to be pretty good, with all the kinda voice leading and counterpointal movement. I was a bit surprised to see CNN could produce such good results. After diving deep into the code and I realize that there's no validation set involved. After some exploration, the generated results are basically "copying" whatever they've seen from the training set which couldnt reflect the generation & generalizing ability of the model. Have you checked such "plagiarism" effect on the generated results? I still believe a validation/test set is needed during training. Else, why bother using the SOTA model (i.e. transformer) right? Why not just using a super-overfitting CNN with much more parameters which would result in equally good results? Regarding sampling, I believe you only used top-k/top-p/temperature-regularized sampling right(correct me if im wrong)? given the overfitting behavior, the logits would tend to heavily distributed to the overfitting token(e.g. [1e4, 1e1, 1e-1, 1e-2]), hence top-p/top-k wouldn't affect much I believe unless you apply a super-high temperature? Happy to discuss! |
Hi,How to generate validation_songs.json?There seems to be no mention in the description of the dataset file.I would appreciate it if you could answer me |
Hi there,
Thanks for the implementation! Appreciate if you could share more insight on why there's no valiadtion/test set involved during training?
Best,
The text was updated successfully, but these errors were encountered: