-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune Hydra #797
Finetune Hydra #797
Conversation
Codecov ReportAttention: Patch coverage is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'm happy with it as is. A few tiny nits, feel free to ignore!
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I added a few small comments. Also, do we want to include fine-tuning configs in this PR? I'm assuming we don't know how well those work yet (especially on MD+all checkpoint, would be easier to have configs for a 2M model) and other might assume they work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rayg1234 ! Just a few small suggestions in file comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Design doc
Description
This PR creates a FineTuneHydra model that allows users to finetune entire models or easily replace heads and finetune. The main concept is to treat a finetune job as starting a brand new training job. So we retain the functionality of the
--mode=train
. We also do not use--checkpoint
option as this would indicate resuming from a checkpoint rather than starting a new training job.To finetune, the user needs to replace the
model
component of a fairchem config with that of a FineTuneHydra model. Astarting_checkpoint
is supplied to tell the FineTuneHydra to start with the initial model and weights from the given checkpoint.Allow finetuning from hydra models, we first support 2 modes:
DATA_ONLY: does not change model, load all previous weights and only finetune on new data
RETAIN_BACKBONE_ONLY: only load backbone and require the user to specify new heads
Example workflow:
fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/equiformer_v2_oc20.yml --optim.batch_size=1 --amp --num-gpus=1 --optim.eval_every=100 --distributed
starting_checkpoint=<checkpoint from oc20 run>
fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/finetune_on_oc22.yml --optim.batch_size=1 --num-gpus=1 --optim.eval_every=100
NOTE here a
--checkpoint
is not given in the command line because we are starting a brand new training run, not resume from a previous statefairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/finetune_on_oc22.yml --optim.batch_size=1 --num-gpus=1 --optim.eval_every=100 --checkpoint "./checkpoints/2024-08-07-23-34-24-test/checkpoint.pt"
starting_checkpoint=<checkpoint from oc22 finetune run>
Not supported in this PR (but available as followup):
Other notable changes
TODO:
Test Plan
Sanity checks
Tests:
pytest tests/core/e2e/test_e2e_finetune_hydra.py