Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ways to reduce memory use #96

Open
peastman opened this issue May 31, 2022 · 15 comments
Open

Ways to reduce memory use #96

peastman opened this issue May 31, 2022 · 15 comments

Comments

@peastman
Copy link
Collaborator

I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.

@giadefa
Copy link
Contributor

giadefa commented May 31, 2022 via email

@PhilippThoelke
Copy link
Collaborator

16 bit floats?

@peastman
Copy link
Collaborator Author

I'm using cutoff 10. I've found that 5 is far too short. It can't reproduce energies for molecules larger than about 40 atoms, and it has no chance at all on intermolecular interactions.

Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.

@giadefa
Copy link
Contributor

giadefa commented May 31, 2022 via email

@PhilippThoelke
Copy link
Collaborator

PhilippThoelke commented May 31, 2022

Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.

That seems odd. Are you sure you are changing batch_size and inference_batch_size?

@peastman
Copy link
Collaborator Author

with a cutoff of 10A what are you using for max_num_neighbors?

80

Are you sure you are changing batch_size and inference_batch_size?

I was changing only batch_size, not inference_batch_size. If I reduce both of them to 32 then I can get it to run. Thanks!

@giadefa
Copy link
Contributor

giadefa commented May 31, 2022 via email

@peastman
Copy link
Collaborator Author

There's no problem with higher values than 80 (except of course running out of memory). 100 also works.

@giadefa
Copy link
Contributor

giadefa commented May 31, 2022 via email

@peastman
Copy link
Collaborator Author

It depends on the particular samples. Does the value of max_num_neighbors apply only to training? Or does it set a limit on any molecule you can ever evaluate with the trained model?

@PhilippThoelke
Copy link
Collaborator

It applies always, not only during training. The argument determines the maximum number of neighbors that are collected in the neighbor list algorithm. You can overwrite it when you load a model checkpoint though, to set it to a higher number for inference for example.

@peastman
Copy link
Collaborator Author

That's good to know. If I want to override it, would I just add the argument max_num_neighbors=100 in the call to load_from_checkpoint()?

@PhilippThoelke
Copy link
Collaborator

I think that should work but better make sure it actually overwrites it. I'd recommend using the torchmdnet.models.model.load_model function to load the model for inference, which strips away the pytorch lightning overhead. There you can just pass it as a keyword argument to overwrite it. You can also for example enable/disable force predictions at inference time using derivative=True/False.

@giadefa
Copy link
Contributor

giadefa commented Oct 11, 2022 via email

@peastman
Copy link
Collaborator Author

Cutoff 5 doesn't work for anything except very small molecules. Above about 40 atoms, it's essential to have a longer cutoff or you get very large errors. I'm hoping that once we add explicit terms for Coulomb and dispersion, that will allow using a shorter cutoff for the neural network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants