You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After #71, we now can control, for a given training batch, whether teacher or student forcing is used. Some recent work suggests that for sequence-to-sequence models there is an advantage to training with student forcing. Some other work recommends gradually rolling out student forcing during training. I propose that we:
experiment with a flag that simply enables student forcing during training and see if things still converge
also experiment with a linear, batchwise rollout of student forcing; that is:
for each batch, we draw a random sample such that with probability p we use teacher forcing and with probability 1 - p we use student forcing
we initialize with p = 1 and after the warmup phase, linearly decrement p so that p = 0 for the last batch
Note that the stochastic option (the second one) is somewhat different from what Bengio et al. do: they do this at the token level. However, this seems harder and slower to implement, so I am suggesting something simpler to start out with.
Both of these can be thought of as hyperparameter free (beyond the boolean decision of whether or not to use student forcing during training at all). If either work we can incorporate into the master branch.
The text was updated successfully, but these errors were encountered:
After #71, we now can control, for a given training batch, whether teacher or student forcing is used. Some recent work suggests that for sequence-to-sequence models there is an advantage to training with student forcing. Some other work recommends gradually rolling out student forcing during training. I propose that we:
Note that the stochastic option (the second one) is somewhat different from what Bengio et al. do: they do this at the token level. However, this seems harder and slower to implement, so I am suggesting something simpler to start out with.
Both of these can be thought of as hyperparameter free (beyond the boolean decision of whether or not to use student forcing during training at all). If either work we can incorporate into the master branch.
The text was updated successfully, but these errors were encountered: