EquiformerV2 + DeNS model and trainer #880

kyonofx · 2024-10-18T01:44:30Z

No description provided.

codecov · 2024-10-18T12:31:29Z

Codecov Report

Attention: Patch coverage is 14.07129% with 458 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...core/models/equiformer_v2/trainers/dens_trainer.py	11.53%	322 Missing ⚠️
...em/core/models/equiformer_v2/equiformer_v2_dens.py	19.52%	136 Missing ⚠️

Files with missing lines	Coverage Δ
src/fairchem/core/trainers/ocp_trainer.py	`69.66% <ø> (ø)`
...em/core/models/equiformer_v2/equiformer_v2_dens.py	`19.52% <19.52%> (ø)`
...core/models/equiformer_v2/trainers/dens_trainer.py	`11.53% <11.53%> (ø)`

IliasChair · 2024-10-19T19:52:58Z

Hey,
Looks like we were tackling the same issue, but you beat me to it! Great solution! Our implementations are pretty similar, but I didn’t add a separate noise head since I didn’t think it looks cleaner without one. While I'm not a acossiated with the OCP repo, I fully support the changes made. I plan to test this implementation comming week if I find the time. I also have some minor suggestions - please see below:

IliasChair

Please see my suggestions below.

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py

IliasChair · 2024-10-19T19:04:16Z

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py

+                * loss_info["fn"](
+                    pred,
+                    target,
+                    natoms=batch.natoms,
+                    batch_size=batch_size,
+                )
+            )
+
+        # Sanity check to make sure the compute graph is correct.
+        for lc in loss:
+            assert hasattr(lc, "grad_fn")


In the DeNS code the appening to loss is done in the else: block. Is it correct that this was moved out?

src/fairchem/core/models/equiformer_v2/equiformer_v2_dens.py

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py

src/fairchem/core/modules/loss.py

IliasChair · 2024-10-22T17:58:06Z

Hello again,
I finally got around to testing your implementation. I have noticed that running it as-is requires a lot of GPU memory. I am running this on one Enterprise NVIDIA H100 GPU, and I can't even get it to run with a batch size of 4. Whereas with the original version, I could easily get away with a batch size of 96. I don't think this is expected behavior; I might have missed a bug somewhere in my review. Is this a known problem? I have also checked that I am running the correct dependency versions.

I am very eager to try out DeNS with the recent OCP additions, If there is anything I can to do assist, let me know!
Since I don't encounter this issue on my Fork I will compare our implementations again and see if I have missed anything after all.

Best
Ilias

lbluque · 2024-10-23T19:41:48Z

Hello again, I finally got around to testing your implementation. I have noticed that running it as-is requires a lot of GPU memory. I am running this on one Enterprise NVIDIA H100 GPU, and I can't even get it to run with a batch size of 4. Whereas with the original version, I could easily get away with a batch size of 96. I don't think this is expected behavior; I might have missed a bug somewhere in my review. Is this a known problem? I have also checked that I am running the correct dependency versions.

I am very eager to try out DeNS with the recent OCP additions, If there is anything I can to do assist, let me know! Since I don't encounter this issue on my Fork I will compare our implementations again and see if I have missed anything after all.

Best Ilias

Hi @IliasChair,

Thanks for looking into this PR and your suggestions. We did not run into issue using a batch size of 8 on A100 GPUs.

If you get a chance to compare with your implementation and flag any difference that may be causing additional memory use please let us know!

IliasChair

I have found another bug in the code. Pleae see below.
Best
Ilias

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py

IliasChair

Hello,
I've got a few more comments for you. Sorry about the back-and-forth. I'll try to put everything into one bigger review if anything else comes up.

src/fairchem/core/models/equiformer_v2/equiformer_v2_dens.py

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py

…2_dens # Conflicts: # src/fairchem/core/common/utils.py # src/fairchem/core/modules/evaluator.py # src/fairchem/core/modules/loss.py # src/fairchem/core/trainers/ocp_trainer.py

lbluque and others added 11 commits August 19, 2024 17:11

add density metrics

367f46e

update trainer & loss

340538f

interleave atoms in loss

33bab2f

merge upstream

13db14c

fix call to keys

807be85

add rmse to evaluation metrics

c426afb

fix linting.

6b8624b

Merge branch 'main' into per-atom-loss

409e0ba

per_atom_loss fix

eaf8bf5

fix test

007515d

Equiformer DeNS model and trainer

d98fbf7

kyonofx requested review from lbluque and wood-b October 18, 2024 01:47

fix linting.

9d809a3

kyonofx added enhancement New feature or request patch Patch version release labels Oct 18, 2024

lbluque added 3 commits October 18, 2024 07:31

lint

b84adc7

Merge branch 'per-atom-loss' into eqv2_dens

5c3a7b3

lint again

db30c98

IliasChair reviewed Oct 19, 2024

View reviewed changes

IliasChair suggested changes Oct 28, 2024

View reviewed changes

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py Outdated Show resolved Hide resolved

IliasChair reviewed Oct 29, 2024

View reviewed changes

src/fairchem/core/models/equiformer_v2/equiformer_v2_dens.py Outdated Show resolved Hide resolved

src/fairchem/core/models/equiformer_v2/trainers/dens_trainer.py Outdated Show resolved Hide resolved

lbluque added 4 commits October 30, 2024 12:44

add type hints

ed43416

empty cuda cache and remove db closing

e09c42c

type hints

02bb2a1

add missing args to docstring

fcad5e1

lbluque added 9 commits October 30, 2024 14:02

add return type hints

ebde98c

Merge branch 'main' of https://github.com/FAIR-Chem/fairchem into eqv…

6ac0683

…2_dens # Conflicts: # src/fairchem/core/common/utils.py # src/fairchem/core/modules/evaluator.py # src/fairchem/core/modules/loss.py # src/fairchem/core/trainers/ocp_trainer.py

rename dens heads

fcbeb0a

Merge branch 'main' into eqv2_dens

6714ab7

move use_densoising to heads

accc0c9

abstract denoising targets

404a1eb

update omat24 dens config

669543a

fix imports

63d4eb0

fix trainer

62659ff

lbluque approved these changes Nov 12, 2024

View reviewed changes

Merge branch 'main' into eqv2_dens

bfbf41a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EquiformerV2 + DeNS model and trainer #880

EquiformerV2 + DeNS model and trainer #880

kyonofx commented Oct 18, 2024

codecov bot commented Oct 18, 2024 •

edited

Loading

IliasChair commented Oct 19, 2024

IliasChair left a comment

IliasChair Oct 19, 2024

IliasChair commented Oct 22, 2024

lbluque commented Oct 23, 2024

IliasChair left a comment

IliasChair left a comment

EquiformerV2 + DeNS model and trainer #880

Are you sure you want to change the base?

EquiformerV2 + DeNS model and trainer #880

Conversation

kyonofx commented Oct 18, 2024

codecov bot commented Oct 18, 2024 • edited Loading

Codecov Report

IliasChair commented Oct 19, 2024

IliasChair left a comment

Choose a reason for hiding this comment

IliasChair Oct 19, 2024

Choose a reason for hiding this comment

IliasChair commented Oct 22, 2024

lbluque commented Oct 23, 2024

IliasChair left a comment

Choose a reason for hiding this comment

IliasChair left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 18, 2024 •

edited

Loading