Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in Trainer #31870

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

richardodliu
Copy link

What does this PR do?

Add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in Trainer, which is based on #29341.

As mentioned in the previous PR, it implemented "a warmup period during which it increases linearly between 0 and the initial learning rate set in the optimizer." However, I recently investigated the DeepSpeed framework and noticed that their function implements additional features, namely "warmup_min_ratio" in https://github.com/richardodliu/DeepSpeed/blob/master/deepspeed/runtime/lr_schedules.py#L774, which supports a warmup start learning rate ratio differs from 0. Considering that DeepSpeed is a crucial component framework in the implementation of Transformers, I aim to ensure consistency between the two in the implementation of learning rate schedulers through this PR. This is to prevent potential confusion for users who employ both frameworks simultaneously.

Our implementation is based on improvements from previous PR, providing significant benefits. Specifically, we allow the reuse of this method without modifying any input parameters(Don't worry; if you prefer to specify the parameter, you are certainly allowed to do so by simply passing it as an argument "warmup_lr_rate", which is used to specify the ratio between the start learning rate and the initial learning rate). In such cases, the implementation is equivalent to setting "warmup_lr_rate" to "1/warmup_steps". Regarding enhancements to the training process, since it is recommended to perform "optimizer.step()" before "lr_scheduler.step()", the learning rate for the batch corresponding to the first step would be zero under the previous implementation. This means that the gradients would not be updated for that batch, effectively wasting it. Our updated method addresses this issue. While this phenomenon may not be noticeable with larger dataset sizes, it becomes significant with smaller datasets where the total number of steps is limited. Additionally, our implementation ensures that the final small learning rate is reachable, rather than being updated only after the training is completed. Overall, our approach is better suited as an improvement to the existing method rather than a complete overhaul. Therefore, we implemented our idea by creating a new function, allowing users the flexibility to choose which implementation method they prefer.

Fixes: WarmupCosineLR missed in WarmupCosineLR

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@muellerzr and @SunMarc

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@richardodliu
Copy link
Author

@muellerzr and @SunMarc : Hello, I submitted this PR over a week ago, but it hasn't been reviewed yet. I've checked the unit test results and confirmed that the failing tests don't seem to be caused by my code changes. Additionally, the test cases I've submitted are passing. Therefore, I believe this PR is ready for review.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for adding this @richardodliu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants