Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EquiformerV2 + DeNS model and trainer #880

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

EquiformerV2 + DeNS model and trainer #880

wants to merge 29 commits into from

Conversation

kyonofx
Copy link
Collaborator

@kyonofx kyonofx commented Oct 18, 2024

No description provided.

@kyonofx kyonofx added enhancement New feature or request patch Patch version release labels Oct 18, 2024
Copy link

codecov bot commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 14.07129% with 458 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...core/models/equiformer_v2/trainers/dens_trainer.py 11.53% 322 Missing ⚠️
...em/core/models/equiformer_v2/equiformer_v2_dens.py 19.52% 136 Missing ⚠️
Files with missing lines Coverage Δ
src/fairchem/core/trainers/ocp_trainer.py 69.66% <ø> (ø)
...em/core/models/equiformer_v2/equiformer_v2_dens.py 19.52% <19.52%> (ø)
...core/models/equiformer_v2/trainers/dens_trainer.py 11.53% <11.53%> (ø)

@IliasChair
Copy link

Hey,
Looks like we were tackling the same issue, but you beat me to it! Great solution! Our implementations are pretty similar, but I didn’t add a separate noise head since I didn’t think it looks cleaner without one. While I'm not a acossiated with the OCP repo, I fully support the changes made. I plan to test this implementation comming week if I find the time. I also have some minor suggestions - please see below:

Copy link

@IliasChair IliasChair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my suggestions below.

Comment on lines 592 to 602
* loss_info["fn"](
pred,
target,
natoms=batch.natoms,
batch_size=batch_size,
)
)

# Sanity check to make sure the compute graph is correct.
for lc in loss:
assert hasattr(lc, "grad_fn")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the DeNS code the appening to loss is done in the else: block. Is it correct that this was moved out?

src/fairchem/core/modules/loss.py Outdated Show resolved Hide resolved
@IliasChair
Copy link

Hello again,
I finally got around to testing your implementation. I have noticed that running it as-is requires a lot of GPU memory. I am running this on one Enterprise NVIDIA H100 GPU, and I can't even get it to run with a batch size of 4. Whereas with the original version, I could easily get away with a batch size of 96. I don't think this is expected behavior; I might have missed a bug somewhere in my review. Is this a known problem? I have also checked that I am running the correct dependency versions.

I am very eager to try out DeNS with the recent OCP additions, If there is anything I can to do assist, let me know!
Since I don't encounter this issue on my Fork I will compare our implementations again and see if I have missed anything after all.

Best
Ilias

@lbluque
Copy link
Collaborator

lbluque commented Oct 23, 2024

Hello again, I finally got around to testing your implementation. I have noticed that running it as-is requires a lot of GPU memory. I am running this on one Enterprise NVIDIA H100 GPU, and I can't even get it to run with a batch size of 4. Whereas with the original version, I could easily get away with a batch size of 96. I don't think this is expected behavior; I might have missed a bug somewhere in my review. Is this a known problem? I have also checked that I am running the correct dependency versions.

I am very eager to try out DeNS with the recent OCP additions, If there is anything I can to do assist, let me know! Since I don't encounter this issue on my Fork I will compare our implementations again and see if I have missed anything after all.

Best Ilias

Hi @IliasChair,

Thanks for looking into this PR and your suggestions. We did not run into issue using a batch size of 8 on A100 GPUs.

If you get a chance to compare with your implementation and flag any difference that may be causing additional memory use please let us know!

Copy link

@IliasChair IliasChair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found another bug in the code. Pleae see below.
Best
Ilias

Copy link

@IliasChair IliasChair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,
I've got a few more comments for you. Sorry about the back-and-forth. I'll try to put everything into one bigger review if anything else comes up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request patch Patch version release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants