Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune Hydra #797

Merged
merged 19 commits into from
Aug 14, 2024
Merged

Finetune Hydra #797

merged 19 commits into from
Aug 14, 2024

Conversation

rayg1234
Copy link
Collaborator

@rayg1234 rayg1234 commented Aug 7, 2024

image

Design doc

Description

This PR creates a FineTuneHydra model that allows users to finetune entire models or easily replace heads and finetune. The main concept is to treat a finetune job as starting a brand new training job. So we retain the functionality of the --mode=train. We also do not use --checkpoint option as this would indicate resuming from a checkpoint rather than starting a new training job.

To finetune, the user needs to replace the model component of a fairchem config with that of a FineTuneHydra model. A starting_checkpoint is supplied to tell the FineTuneHydra to start with the initial model and weights from the given checkpoint.

Allow finetuning from hydra models, we first support 2 modes:
DATA_ONLY: does not change model, load all previous weights and only finetune on new data

model:
  name: finetune_hydra
  finetune_config:
    mode: DATA_ONLY
    starting_checkpoint: "./checkpoints/2024-08-07-20-20-16-test/checkpoint.pt"

RETAIN_BACKBONE_ONLY: only load backbone and require the user to specify new heads

model:
  name: finetune_hydra
  finetune_config:
    mode: RETAIN_BACKBONE_ONLY
    starting_checkpoint: "./checkpoints/2024-08-07-20-20-16-test/checkpoint.pt"
    heads:
      oc22_energy:
        module: equiformer_v2_energy_head
      oc22_forces:
        module: equiformer_v2_force_head

Example workflow:

  1. Train original oc20 model
  • fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/equiformer_v2_oc20.yml --optim.batch_size=1 --amp --num-gpus=1 --optim.eval_every=100 --distributed
  1. Finetuning a oc20 model on oc22 data:
  • create finetune config yml with starting_checkpoint=<checkpoint from oc20 run>
  • fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/finetune_on_oc22.yml --optim.batch_size=1 --num-gpus=1 --optim.eval_every=100
    NOTE here a --checkpoint is not given in the command line because we are starting a brand new training run, not resume from a previous state
  1. Resume the training of 2.
  • fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/finetune_on_oc22.yml --optim.batch_size=1 --num-gpus=1 --optim.eval_every=100 --checkpoint "./checkpoints/2024-08-07-23-34-24-test/checkpoint.pt"
  1. Finetune another dataset from checkpoint of 2.
  • create another finetune config yml with starting_checkpoint=<checkpoint from oc22 finetune run>

Not supported in this PR (but available as followup):

  • General finetune mode where heads can be partially retained and used as input

Other notable changes

  • Removed mutation of model config (model -> model_attributes) in base_trainer. this only affects any downstream applications that assumes "model_attributes" in the checkpoint's config, which I have not found any hard dependencies

TODO:

  • add configs,
  • add tests

Test Plan

Sanity checks

  • Run finetuning oc22 (ontop of oc20) with DATA_ONLY
  • Run finetuning oc22 with new force/energy heads
  • Run finetuning oc22 with new single force head
  • Resume interrupted finetuning run
  • Run finetuning a second time from finetuned model
  • Train a new oc20 base 31M model on cluster
  • Finetune oc22 on the fully trained oc20 model on cluster

Tests:
pytest tests/core/e2e/test_e2e_finetune_hydra.py

@rayg1234 rayg1234 marked this pull request as ready for review August 8, 2024 17:22
Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 95.10490% with 7 lines in your changes missing coverage. Please review.

Files Patch % Lines
src/fairchem/core/models/finetune_hydra.py 96.19% 4 Missing ⚠️
src/fairchem/core/models/base.py 87.50% 2 Missing ⚠️
src/fairchem/core/common/utils.py 90.00% 1 Missing ⚠️
Files Coverage Δ
src/fairchem/core/trainers/base_trainer.py 89.76% <100.00%> (+0.20%) ⬆️
src/fairchem/core/common/utils.py 66.10% <90.00%> (+0.36%) ⬆️
src/fairchem/core/models/base.py 86.88% <87.50%> (-1.08%) ⬇️
src/fairchem/core/models/finetune_hydra.py 96.19% <96.19%> (ø)

... and 8 files with indirect coverage changes

@rayg1234 rayg1234 requested review from misko and wood-b August 10, 2024 05:48
misko
misko previously approved these changes Aug 12, 2024
Copy link
Collaborator

@misko misko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'm happy with it as is. A few tiny nits, feel free to ignore!
LGTM!

src/fairchem/core/models/finetune_hydra.py Show resolved Hide resolved
src/fairchem/core/models/finetune_hydra.py Outdated Show resolved Hide resolved
src/fairchem/core/models/finetune_hydra.py Outdated Show resolved Hide resolved
src/fairchem/core/models/finetune_hydra.py Outdated Show resolved Hide resolved
tests/core/e2e/test_e2e_finetune_hydra.py Outdated Show resolved Hide resolved
tests/core/e2e/test_e2e_finetune_hydra.py Show resolved Hide resolved
tests/core/e2e/test_s2ef.py Show resolved Hide resolved
Copy link
Collaborator

@wood-b wood-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I added a few small comments. Also, do we want to include fine-tuning configs in this PR? I'm assuming we don't know how well those work yet (especially on MD+all checkpoint, would be easier to have configs for a 2M model) and other might assume they work well.

src/fairchem/core/models/base.py Show resolved Hide resolved
src/fairchem/core/models/finetune_hydra.py Show resolved Hide resolved
src/fairchem/core/trainers/base_trainer.py Show resolved Hide resolved
@rayg1234 rayg1234 added minor Minor version release enhancement New feature or request labels Aug 13, 2024
Copy link
Collaborator

@lbluque lbluque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rayg1234 ! Just a few small suggestions in file comments.

src/fairchem/core/trainers/base_trainer.py Show resolved Hide resolved
src/fairchem/core/models/base.py Show resolved Hide resolved
src/fairchem/core/models/finetune_hydra.py Outdated Show resolved Hide resolved
tests/core/e2e/test_e2e_commons.py Outdated Show resolved Hide resolved
@rayg1234 rayg1234 enabled auto-merge August 13, 2024 21:55
@rayg1234 rayg1234 dismissed lbluque’s stale review August 13, 2024 23:08

fixed changes

@rayg1234 rayg1234 added this pull request to the merge queue Aug 13, 2024
Copy link
Collaborator

@wood-b wood-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Merged via the queue into main with commit 8fb16d6 Aug 14, 2024
12 checks passed
@rayg1234 rayg1234 deleted the rgao_finetune_hydra branch August 14, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor Minor version release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants