Demonstrate parallel execution of a loss function #50

jackbaker1001 · 2023-09-12T18:01:27Z

On an HPC cluster, each term in a mean square loss can be calculated using embarrassingly parallel logic.

Unfortunately, the native way of doing this with jax (using jax.vmap and jax.pmap) is not compatible with input we must parallelize over: the Molecule object. This is because its data is stored in "ragged" structure. I.e., the dimensions of the grid for one molecule are very often different from the grid for another and the dimensions of the 1-RDM for one molecule is different for another: jnp.array([rdm1_1, rdm1_2]) will not work.

This means that for loss parallelism, we need to think differently. Sharding may be the way forward, but this requires more thought. A good reference is here: https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html

I don't think we will get around to solving this problem before our release deadline, but if we want to do something with HPC, getting this right is non-negotiable.

The text was updated successfully, but these errors were encountered:

PabloAMC · 2023-09-12T18:29:05Z

@Matematija recommended sharding too.

jackbaker1001 · 2023-12-07T17:03:53Z

Related to #83

jackbaker1001 · 2023-12-11T19:01:41Z

Having playing around with the multiple hosts parallelism in JAX, I came across many issues on Perlmutter with the detection of GPUs.

I'm giving mpi4jax a go for this task now. It should be fairly easy if this works well on Perlmutter.

jackbaker1001 mentioned this issue Sep 12, 2023

Implement new loss functions in train.py #43

Closed

jackbaker1001 self-assigned this Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demonstrate parallel execution of a loss function #50

Demonstrate parallel execution of a loss function #50

jackbaker1001 commented Sep 12, 2023

PabloAMC commented Sep 12, 2023

jackbaker1001 commented Dec 7, 2023

jackbaker1001 commented Dec 11, 2023

Demonstrate parallel execution of a loss function #50

Demonstrate parallel execution of a loss function #50

Comments

jackbaker1001 commented Sep 12, 2023

PabloAMC commented Sep 12, 2023

jackbaker1001 commented Dec 7, 2023

jackbaker1001 commented Dec 11, 2023