Avoiding duplicated computations by having a single observable model #1855

APJansen · 2023-11-20T12:25:16Z

Goal

The goal of this PR is to speed up the code by a factor 2 by a refactoring that avoids redoing the same computations.
Currently there are separate training and validation models.
At every training step the validation model is run from scratch on x inputs, while the only difference with the training model is in the final masking just before computing the loss.

This will hopefully also improve readability. From an ML point of view the current naming is very confusing. Instead of having a training model and a validation model, we can have a single observable model, and on top of that a training and validation loss. (Just talking about names here, they may still be MetaModels)

The same holds of course for the experimental model, except that there is no significant performance cost there. But for consistency and readability let's try to treat that on the same footing.

This PR branches off of trvl-mask-layers because that PR changes the masking. That one should be merged before this one.

Current implementation

Models creation

The models are constructed in ModelTrainer._model_generation.
Specifically in the function _pdf_injection, which is given the pdfs, a list of observables and a corresponding list of masks.
For the different "models", both the values of the mask but also the list of observables change, as not all models use all observables, in particular the positivity and integrability ones.
This function just calls the observables on the pdfs with the mask as argument.
And each observable's call method, defined here, does two steps: 1. compute the observable, 2. apply the mask and compute the loss.

Models usage

Once they are created, the training model is, obviously, used for training here.
The validation model used to initialize the Stopping object. The only thing that happens there is that its compute_losses method is called. Similarly for the experimental model, where it is called directly in the ModelTrainer (here).

Changes proposed

Decouple the masking and loss computation from the ObservableWrapper class. Just remove those parts from ObservableWrapper, and create perhaps an ObservableLoss layer that does this.
Apply this pure observable class to the pdfs, for all observables, to create an observables_model.
Create 3 loss models, that take as input all observables, do both a masking and a selection and a computation of losses.
For the training one, put it on top of the observables_model, to create a model identical to the current training model.
Add the output of the observables_model to the output list of this training model, so these can be reused.
The validation and experimental models can be discarded, instead we have the validation and experimental losses that are applied to the output of the observables_model. So e.g. we can replace self.experimental["model"].compute_losses() with experimental_loss(observables).

APJansen · 2023-11-27T15:09:20Z

I'm looking into the first point, decoupling the computation of the observables from their masking and loss. Some questions @goord

Currently in _generate_experimental_layer this happens:

observables computed
masks applied one by one
masked observables concatenated
some rotation
loss computed

What does this rotation do?

And is it possible to change this to (at the cost of concatenating masks inside observable_generator):
Here:

observables computed
UNmasked observables concatenated
(rotation?)

subsequent loss layer:

masks applied to concatenated observables
loss computed

Also I've probably seen this before but still I'm confused why there is both a mask applied directly to the observables, in _generate_experimental_layer, and also inside the LossInvcovmat itself?

goord · 2023-11-27T15:43:27Z

Yes these rotations are triggered by the 'data_transformation_tr', which is used if you represent the experimental data in a covariance-diagonal basis I guess. I'm not sure when this is actually used, and I'm not sure whether this code path is propoerly tested in the trvl-mask-layers branch...

goord · 2023-11-27T15:55:29Z

The mask in LossInvCovmat is not used for masking training/validation I think.

APJansen · 2024-01-12T08:27:30Z

@goord Why does the experimental output have rotation=None while the others have rotation=obsrot, around here? Is that intentional? It interferes a bit with how I thought the observable computation would decouple from the masked loss.

goord · 2024-01-12T08:56:27Z

@goord Why does the experimental output have rotation=None while the others have rotation=obsrot, around here? Is that intentional? It interferes a bit with how I thought the observable computation would decouple from the masked loss.

This is a rewrite of line 289 in the master. I don't know why the diagonal basis is not used for the experimental output layer, perhaps @scarlehoff or @RoyStegeman can explain us.

If you look at n3fit_data.py you can see that in the diagonal basis, the training and validation covmats are being masked and then inverted, but the full covmat inverse (inv_true) is computed in the old basis.

scarlehoff · 2024-01-12T09:29:33Z

Because when they were separated it didn't really matter and it is decoupled from training / validation (the idea of diagonalising is to be able to do the split removing the correlations within a dataset between training and validation).

APJansen · 2024-01-12T09:50:06Z

Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards.
It's passing all the tests and also giving identical results for the main runcard.

goord · 2024-01-12T09:58:43Z

Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards. It's passing all the tests and also giving identical results for the main runcard.

You can try the diag-DIS runcard to check the observable rotation: DIS_diagonal_l2reg_example.yml.txt

APJansen · 2024-01-12T10:07:15Z

Seems to work fine, and gives the same results as trvl-mask-layers.

scarlehoff · 2024-01-12T10:07:22Z

I don't fully understand

The chi2 (should not) depend of the diagonalization, since the total covmat is only used to report the total chi2, nobody cared about diagonalising that because it was not needed.

but is it ok to uniformize this?

Yes because see above.

APJansen · 2024-01-12T10:32:36Z

Ok perfect, thanks :)

APJansen · 2024-01-12T12:37:06Z

@scarlehoff @Radonirinaunimi
Positivity is included in the validation model, I remember we discussed this before, and if I remember correctly there was some disagreement on whether this was necessary or not, is that right?
If I remove it, I get an error from this line, which can be fixed by changing fitstate.validation to fitstate._training, after which it runs normally (though I haven't done any comparisons).

Right now I'm thinking that to remove the repeated calculation of observables, the easiest is to combine the training and validation models into one model that computes both of their losses, adding a "_tr" and "_val" postfix and filtering as appropriate when summing to get the final train/val losses. The experimental one can stay separate as the performance loss there is negligible.

Does that sound ok?

Of course it would be nicer to instead just have one model and 3 different losses, but that will take longer to implement.

scarlehoff · 2024-01-12T12:39:41Z

I don't understand what you mean. The easiest way looks much more complex to me since you need to filter out things and any bug there will "break" the validation.

scarlehoff · 2024-01-12T12:44:53Z

Also, I'm not completely sure you can achieve your goal here?

You need to compute everything twice for every epoch just the same.

APJansen · 2024-01-12T12:55:09Z

What I mean is we would have one model of the form (say we only have the DEUTERON observable) x -> pdf -> DEUTERON -> (DEUTERON_tr, DEUTERON_val), where the tuple are two separate layers containing the respective losses, and DEUTERON by itself is the full observable, without any mask (the computation up to and including that layer is the one we don't want to repeat).

This I think is what requires the least changes. I haven't worked it all out yet, but in the end the Stopping class won't need a separate validation model, and we don't need this compute_losses in MetaModel, it should just receive all the losses, both train and val, inside the history dict. And the default_loss defined above MetaModel would need to check if the key ends in "_tr" and give 0 otherwise.

APJansen · 2024-01-12T12:56:53Z

In this PR I've already decoupled the computation of the observable from the masking+loss, that was quite simple and gives identical results. The tricky part is how to use that to avoid this repeated computation of the observable (and PDF).

scarlehoff · 2024-01-12T12:57:30Z

where the tuple are two separate layers containing the respective losses

Yes, but your goal is to reduce the number of calls. However, you will need to call the model once for the optimization.
Then the weights are updated.
Then you call it a second time to compute the validation for that step.
Then you check whether the positivity passes and whatsnot.

So there is no way to avoid the repeated computation of the observable.

APJansen · 2024-01-12T13:01:31Z

Ah I hadn't thought about that, you're right that conventionally the validation at step t is computed after training for t steps. My proposal would have a shift by one step (epoch) with respect to this convention, in effect computing the validation losses from step t-1 at step t. But I don't think that's a big deal right? Changes should be tiny from one step to the next.
This should give a 50% speedup (not 100% because the validation only has a forward pass while the training also has a backward pass), I think that's very much worth it.

scarlehoff · 2024-01-12T13:06:26Z

It is a big deal because that tiny change can move you from a physical to an unphysical situation by means of positivity, integrability and probably even normalisation.

But also, in general, it's simply not correct since the epoch at which you wanted to stop was the previous one.

APJansen · 2024-01-12T13:11:52Z

True, but that should be easy to solve. Just save the weights at every step and when the stopping condition hits, instead of just stopping, revert to the previous epoch.
The weights are tiny so I think that shouldn't be a problem.

scarlehoff · 2024-01-12T13:17:44Z

Check the speed up you would get in a fit and whether it doesn't become much more complicated.

The forward pass is not the heaviest part of the calculation and a 50% speedup there won't translate to a speed up of the entire fit. So unless it makes a big different I'm strongly against adding this. It adds a lot of bug-cross section and complicates quite a bit using custom optimizers.

APJansen · 2024-01-12T13:28:42Z

This old tensorboard profile illustrates the speedup. The gaps will be mostly removed by epochs-to-batches. Of the rest the validation step is more than 50% of the training step.

While I didn't think of the issue you mentioned, I still think it should be possible with minimal addition of complexity (it removes some and adds some).
I'll have a go and then we can see if it's worth it.

scarlehoff · 2024-01-12T13:36:24Z

If the final code is not very complex I'd be happy with it. From what you explain in the comments it looks complicated. Specially the idea of the internal filtering of losses.

The tensorboard profile is not enough to know what would be the effect on the fit (specially if it is old, many things have changed in the meanwhile). Note that you will still need to wait and check positivity, check the chi2 etc.

Btw, going back to the experimental model. Note that the comparison data in the experimental model is different from training/validation so it doesn't really matter how you do it, you need to recompute the losses for that one.

APJansen · 2024-01-19T13:27:23Z

The bulk of the work is done, and I've tested that the speedup is indeed about 33% (tested with 1 replica on CPU with NNPDF40_nnlo_as_01180_1000.yml, but it should be the same for e.g. 100 replicas on GPU, as it just skips about 33% of the work).

Outcomes on a small runcard I've tested are identical training chi2s, identical validation chi2s but shifted 1 step, and identical final chi2's (train/val/exp) after reverting the weights one step, which was trivial to do (only at the cost of storing 2 copies of the weights, but they are tiny).

The structure now is that we have one ModelTrainer.observables_model that computes all observables, and a dictionary ModelTrainer.losses with keys "training", "validation", "experimental", and values for each are again dictionaries of per experiment losses, appropriately masked and filtered for each split.

Note that all losses can be given all observables as input, Keras will just select only the one it needs.
Also, we can include the experimental observables in the same model without performance cost, as tensorflow will realise during training that those outputs aren't being used in the loss and avoid computing them. I verified this using the script below, where the small and smallest model have identical training times, while the big one takes 20 times longer.

It needs some minor tweaks and more testing, but before going into that I would like to know if you agree now with this approach broadly speaking @scarlehoff.
I've also cleaned up the stopping module a bit more, where it was needed for the main changes here, though certainly more can be done. (FitState and FitHistory could at this point be easily removed entirely, but this PR is already getting too big)

timing script

import time
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np

i_small = Input(shape=(1,))
i_big = Input(shape=(1000,))

d_small = Dense(1)(i_small)
d_large = Dense(300_000)(i_big)
d2_large = Dense(300_000)(i_big)
d3_large = Dense(1)(d2_large)

model = Model(inputs=[i_small, i_big], outputs=[d_small, d3_large])
model_small = Model(inputs=[i_small, i_big], outputs=[d_small])
model_smallest = Model(inputs=[i_small], outputs=[d_small])

model.compile(optimizer="adam", loss="mse")
model_small.compile(optimizer="adam", loss="mse")
model_smallest.compile(optimizer="adam", loss="mse")

x_small = np.random.rand(1000, 1)
x_big = np.random.rand(1000, 1000)
y_small = np.random.rand(1000, 1)
y_big = np.random.rand(1000, 1)

def timefit(model, xs, ys):
    start = time.time()
    model.fit(xs, ys)
    end = time.time()
    print(f"Time for fit: {end-start:.5} s")

timefit(model_smallest, x_small, y_small)
timefit(model_small, [x_small, x_big], [y_small, y_big])
timefit(model, [x_small, x_big], [y_small, y_big])

scarlehoff · 2024-01-19T13:37:59Z

It does look better than I expected:

I cannot commit to having a look before late next week though. What is the chance of separating this from trvl-mask-layer and having it branch off master?

If that would be too much, can we focus on finalizing / merging #1788 in the next few weeks and then go back here?

APJansen · 2024-01-19T14:29:44Z

Rebasing to master would be difficult, but waiting for trvl-mask-layers to be merged is fine, that should have priority anyway.

scarlehoff · 2024-01-19T14:31:52Z

Then let's do that. Next week I'll have a look (is it finished?) at trvl-mask-layers and rebase, squash or whatever to update the 2-years old commits. I think you have been keeping it up to date with master so it should be trivial.

And then we go back to this one.

APJansen · 2024-01-19T14:33:58Z

Perfect! Indeed trvl-mask-layers is tested and ready to review, and I just merged master into it today.

APJansen added the Refactoring label Nov 20, 2023

APJansen requested a review from goord November 20, 2023 12:25

RoyStegeman added the escience label Nov 29, 2023

APJansen mentioned this pull request Dec 4, 2023

Realising a factor 20-30 speedup on GPU #1803

Closed

APJansen closed this Jan 11, 2024

APJansen force-pushed the merge-observable-splits branch from ed168a1 to 61188c8 Compare January 11, 2024 13:15

APJansen reopened this Jan 12, 2024

APJansen force-pushed the merge-observable-splits branch 4 times, most recently from a8e57c8 to 3d26ce0 Compare January 19, 2024 13:26

goord force-pushed the trvl-mask-layers branch from 66721b6 to 8fda1ca Compare February 1, 2024 09:25

This was referenced Feb 13, 2024

Fk refactor #1936

Merged

Avoid idle gpu #1939

Merged

goord force-pushed the trvl-mask-layers branch 5 times, most recently from de3b55c to f794be6 Compare February 15, 2024 19:22

APJansen force-pushed the trvl-mask-layers branch 3 times, most recently from 6bd0fb6 to c2e4935 Compare February 20, 2024 12:17

APJansen mentioned this pull request Feb 21, 2024

Parallel replicas with varying tr-vl masks #1788

Merged

APJansen force-pushed the merge-observable-splits branch 2 times, most recently from c5d97c9 to 374a0a0 Compare February 21, 2024 14:29

Base automatically changed from trvl-mask-layers to master February 22, 2024 12:32

APJansen force-pushed the merge-observable-splits branch 2 times, most recently from f696bfd to 0c8e095 Compare February 24, 2024 15:21

APJansen mentioned this pull request Mar 4, 2024

Finalizing eScience contributions #1977

Closed

APJansen added 2 commits March 5, 2024 18:32

Join models into one observables model and 3 losses

4edcc84

Optimize saving of weights

54f5e9a

APJansen force-pushed the merge-observable-splits branch from f98efb0 to 54f5e9a Compare March 5, 2024 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding duplicated computations by having a single observable model #1855

Avoiding duplicated computations by having a single observable model #1855

APJansen commented Nov 20, 2023

APJansen commented Nov 27, 2023

goord commented Nov 27, 2023

goord commented Nov 27, 2023

APJansen commented Jan 12, 2024

goord commented Jan 12, 2024 •

edited

Loading

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

goord commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024 •

edited

Loading

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024 •

edited

Loading

APJansen commented Jan 19, 2024

scarlehoff commented Jan 19, 2024

APJansen commented Jan 19, 2024

scarlehoff commented Jan 19, 2024

APJansen commented Jan 19, 2024

Avoiding duplicated computations by having a single observable model #1855

Are you sure you want to change the base?

Avoiding duplicated computations by having a single observable model #1855

Conversation

APJansen commented Nov 20, 2023

Goal

Current implementation

Models creation

Models usage

Changes proposed

APJansen commented Nov 27, 2023

goord commented Nov 27, 2023

goord commented Nov 27, 2023

APJansen commented Jan 12, 2024

goord commented Jan 12, 2024 • edited Loading

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

goord commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024 • edited Loading

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024

APJansen commented Jan 12, 2024

scarlehoff commented Jan 12, 2024 • edited Loading

APJansen commented Jan 19, 2024

scarlehoff commented Jan 19, 2024

APJansen commented Jan 19, 2024

scarlehoff commented Jan 19, 2024

APJansen commented Jan 19, 2024

goord commented Jan 12, 2024 •

edited

Loading

scarlehoff commented Jan 12, 2024 •

edited

Loading

scarlehoff commented Jan 12, 2024 •

edited

Loading