Revert the default for parallel fits to a single model per replica #2127

scarlehoff · 2024-07-16T07:56:02Z

This PR reverts back to a model-per-replica configuration (instead of the multidense layer)
since the multidense layer is not able to fit models independently (at the moment) which is necessary for a complete fit.

For sequential fits nothing should change, for parallel fit the difference is of about a factor of 2 in time
(but I'm guessing that this factor of 2 we are gaining might be why the fits are of worse quality).

I've introduced a multidense layer type if people want to play around with multidense.

Here two reports, as you can see multidense has a terrible chi2 while dense is statistically compatible with 4.0.

multidense: https://vp.nnpdf.science/UGSJ6OsPTdSR2s97OXIvDw==
dense: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==

I haven't looked deep into the details of the error but I don't think multidense was truly benchmarked in a real-life scenario? At least I could not find any examples.

Anyway, for reference, these were fits of 120 replicas in a RTX 2080. 80 minutes with dense, 40 minutes with multidense.
It would be ideal if multidense could be fixed but I can live with 80 minutes runs... for now checks.py will fail if you use multidense to make sure it is only used as a development option for now.

Fixes, in a way, #2118

TODO:

Fix regression tests, note that most of the .json files have no changes (proving that, for one replica, multidense and dense were really the same)
Check it works with TF 2.16/2.17 (if I'm able to get a computer with both a GPU and a recent-enough version of python/tf at the same time, not trivial)

RE the last point, I cannot check but the python tests are getting TF 2.17 so the code does work.

n3fit/src/n3fit/checks.py

n3fit/src/n3fit/model_gen.py

goord · 2024-07-18T08:26:17Z

Hi @scarlehoff I can do the TF2.17/python3.12 test (on a CPU though :))

scarlehoff · 2024-07-18T08:30:37Z

In CPU works fine (some of the tests are running tf 2.17). But since there were big changes in TF 2.16 I wonder whether it changed something in GPU.

goord · 2024-07-18T08:36:39Z

Ok I will ask @Cmurilochem to do the test, I don't have access to a GPU right now. I will start looking for the bug in the multidense layer type.

scarlehoff · 2024-07-18T09:00:54Z

No, that one (at least according to the runcard) was run sequentially.

scarlehoff · 2024-07-19T08:55:56Z

If one of you confirm it also works for you, we can merge this.

Radonirinaunimi · 2024-07-19T08:56:54Z

If one of you confirm it also works for you, we can merge this.

I can test this over the weekend.

goord · 2024-07-19T13:48:35Z

I am suspicious whether the multidense implementation results in the correct gradient descent process. Comparing the single-dense to the multi-dense for parallel replicas, I see quite different behavior of the pdf training evolution. The single-dense jumps around a lot more than the multi-dense pdf, I'm not sure we can attribute that to numerical noise...

Cmurilochem · 2024-07-20T05:26:18Z

If one of you confirm it also works for you, we can merge this.

Hi @scarlehoff.
I tested this branch as request and ran a parallel fit using GPUs and python 3.11 + tf 2.15.1; see below.

python 3.11 + tf 2.15.1
https://vp.nnpdf.science/ICdkTBY0T2ebanL1lLNcMg==

The results match nearly perfectly the ones from the sequential NNPDF40_nnlo_as_01180_qcd one.

As of today I am not able however to run a parallel GPU fit with newer versions of tensorflow:

With python 3.11 + tf 2.16.2 or 2.17.0: I have strange problems associated with memory leaks (Enabling XLA in tensorflow 2.16 causes memory leaks tensorflow/tensorflow#64170).
With python 3.12 + tf 2.16.2 or 2.17.0: Tensorflow never recognises the GPUs in snellius environment.
These issues require some more investigation but certainly do not have anything to do with our reproducibility issue.

scarlehoff · 2024-07-20T06:48:23Z

Nice, I face the second issue (my gpu is not recognized).

Can you disable XLA and see whether it works?

XLA_FLAGS=--xla_disabled_backends=cpu,gpu

it will be slower but we just want to check whether it works

scarlehoff · 2024-07-20T07:22:06Z

I am suspicious whether the multidense implementation results in the correct gradient descent process.

My guess (which I haven't tested) is that when you do a single model per replica, each model is trained independently (so, they all descend the gradient independently).

Instead, when you have one single model, they all have to descend the gradient together. If you test with a small number of models chances are that, even if they are descending together, they both converge. The more models you add, the less likely it is that they are all going to converge at once.

goord · 2024-07-21T08:38:11Z

Right, in my opinion the multidense approach is rather dangerous. Any operation that mixes the weights in the replica dimension will lead errors. It means that you need to know exactly what goes on in the optimizer for instance.

Cmurilochem · 2024-07-22T07:32:52Z

Nice, I face the second issue (my gpu is not recognized).

Can you disable XLA and see whether it works?
XLA_FLAGS=--xla_disabled_backends=cpu,gpu
it will be slower but we just want to check whether it works

Thanks @scarlehoff. I managed to make my python 3.12 + tf 2.17.0 to recognise the gpu (I just forgot to pip install tensorflow[and-cuda]). I tried to disable XLA by setting up your suggest environment variable but it unfortunately does not work, with the calculation being killed after some time.

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1721631255.503027 3888332 service.cc:146] XLA service 0x14d384005100 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1721631255.504364 3888332 service.cc:154]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2024-07-22 08:54:23.168637: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-22 08:54:51.618186: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
2024-07-22 08:55:16.840733: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 132 bytes spill stores, 132 bytes spill loads

2024-07-22 08:55:17.070172: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 96 bytes spill stores, 96 bytes spill loads

2024-07-22 08:55:17.145386: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 468 bytes spill stores, 468 bytes spill loads

2024-07-22 08:55:17.158346: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 88 bytes spill stores, 88 bytes spill loads

I0000 00:00:1721631365.216225 3888332 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
/var/spool/slurm/slurmd/job7079977/slurm_script: line 53: 3886507 Killed                  n3fit ../../nnpdf40-like.yml 1 -r 120

Without this flag, the calc gets stuck after generating the models and the fit never starts....

scarlehoff · 2024-07-22T07:49:55Z

It does look like a memory leak.

I0000 00:00:1721631365.216225 3888332 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.

But it looks like XLA is still active so might still be the same issue you linked before.

scarlehoff · 2024-07-22T09:04:21Z

@Cmurilochem please run setting set_eager(True) here

nnpdf/n3fit/src/n3fit/backends/keras_backend/internal_state.py

Line 30 in 737382f

This should make it 1) Run slower 2) Not compile anything

But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF).

…tidense_default.

scarlehoff · 2024-07-24T09:10:02Z

As far as I understand, this is now going to be what is going to be used for the following runs, so I'll merge it to master once the tests pass.

scarlehoff · 2024-07-24T11:06:05Z

@Cmurilochem please run setting set_eager(True) here

nnpdf/n3fit/src/n3fit/backends/keras_backend/internal_state.py

Line 30 in 737382f

This should make it 1) Run slower 2) Not compile anything

But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF).

I tested this and seem to be possible to run. Hopefully we can get true (compiled and everything) runs in GPU with tf > 2.16 but at the moment it looks more like a bug in the tf side?

Cmurilochem · 2024-07-24T17:52:34Z

@Cmurilochem please run setting set_eager(True) here

nnpdf/n3fit/src/n3fit/backends/keras_backend/internal_state.py

Line 30 in 737382f

This should make it 1) Run slower 2) Not compile anything
But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF).

I tested this and seem to be possible to run. Hopefully we can get true (compiled and everything) runs in GPU with tf > 2.16 but at the moment it looks more like a bug in the tf side?

Hi @scarlehoff. I tested setting set_eager(True) for both tf 16 and 17 (both with python 12). The fit took more than 30h and did not finish after all. We are moving on to performing the new hyperopt experiments with py 3.11 and tf 15, hopping that this can be addressed later.

scarlehoff · 2024-07-24T20:07:30Z

How many replicas? I tried with ~20 and seemed ok.

Cmurilochem · 2024-07-25T04:04:44Z

How many replicas? I tried with ~20 and seemed ok.

I tried with 120 replicas as my goal was to compare the fit with the baseline. But it was proven very expensive. Did you manage to check that with 20 replicas we reproduce the sequential fits ? I guess the idea was just to check if we could run with the new versions.

Cmurilochem · 2024-07-25T04:08:44Z

Using the newly merged master with the changes from this PR, I am having a strange error in my hyperopt run I've never seen:

Traceback (most recent call last):
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/n3fit/src/n3fit/scripts/n3fit_exec.py", line 333, in run
    super().run()
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/app.py", line 151, in run
    super().run()
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/app.py", line 406, in run
    rb.execute_sequential()
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/resourcebuilder.py", line 210, in execute_sequential
    result = self.get_result(callspec.function,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/resourcebuilder.py", line 380, in get_result
    fres = function(**kwdict)
           ^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/covmats.py", line 249, in dataset_t0_predictions
    return central_predictions(dataset, t0set).to_numpy().reshape(-1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/convolution.py", line 236, in central_predictions
    return _predictions(dataset, pdf, central_fk_predictions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/convolution.py", line 168, in _predictions
    fk_w_cuts = fk.load_with_cuts(cuts)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/core.py", line 571, in load_with_cuts
    return load_fktable(self).with_cuts(cuts)
           ^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/fkparser.py", line 61, in load_fktable
    tabledata = pineappl_reader(spec)
                ^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/pineparser.py", line 233, in pineappl_reader
    lumi_columns = _pinelumi_to_columns(p.channels(), hadronic)
                                        ^^^^^^^^^^
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/pineappl/utils.py", line 17, in __getattr__
    return self._raw.__getattribute__(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'builtins.PyFkTable' object has no attribute 'channels'

@scarlehoff or @Radonirinaunimi. Any thought on that ?

scarlehoff · 2024-07-25T06:42:21Z

You need to reinstall pineappl. This ia due to an update of the pineappl package.

Cmurilochem · 2024-07-25T07:24:24Z

You need to reinstall pineappl. This ia due to an update of the pineappl package.

Oh...it worked! Thanks!!

scarlehoff changed the base branch from master to avoid-idle-gpu July 16, 2024 07:57

scarlehoff linked an issue Jul 16, 2024 that may be closed by this pull request

No agreement between parallel gpu and sequential cpu fits #2118

Closed

scarlehoff requested a review from RoyStegeman July 16, 2024 08:17

scarlehoff added the redo-regressions Recompute the regression data label Jul 16, 2024

RoyStegeman reviewed Jul 16, 2024

View reviewed changes

n3fit/src/n3fit/checks.py Show resolved Hide resolved

RoyStegeman reviewed Jul 16, 2024

View reviewed changes

n3fit/src/n3fit/model_gen.py Show resolved Hide resolved

scarlehoff added redo-regressions Recompute the regression data and removed redo-regressions Recompute the regression data labels Jul 16, 2024

scarlehoff force-pushed the revert_multidense_default branch from e4e9387 to 0a1b331 Compare July 16, 2024 15:56

scarlehoff removed the redo-regressions Recompute the regression data label Jul 16, 2024

scarlehoff force-pushed the revert_multidense_default branch from b9c4498 to 681e4af Compare July 17, 2024 10:16

scarlehoff force-pushed the avoid-idle-gpu branch from a3ea3ac to 1637dba Compare July 17, 2024 12:13

scarlehoff force-pushed the revert_multidense_default branch from 681e4af to 0a56a45 Compare July 17, 2024 12:17

scarlehoff mentioned this pull request Jul 17, 2024

Avoid idle gpu #1939

Merged

Base automatically changed from avoid-idle-gpu to master July 17, 2024 13:34

Radonirinaunimi mentioned this pull request Jul 18, 2024

No agreement between parallel gpu and sequential cpu fits #2118

Closed

scarlehoff and others added 4 commits July 24, 2024 11:09

add a multidense layer type, make a model-per-replica default

45e4d20

update regression tests to use single dense

101e846

Automatically regenerated regressions from PR 2127, branch revert_mul…

f2c8c13

…tidense_default.

set the seed explicitly as an int

c695d27

scarlehoff force-pushed the revert_multidense_default branch from 6640d0d to c695d27 Compare July 24, 2024 09:09

scarlehoff merged commit 803fab7 into master Jul 24, 2024
6 checks passed

scarlehoff deleted the revert_multidense_default branch July 24, 2024 13:04

This was referenced Jul 25, 2024

To Do for 4.0.10 #1854

Open

Enabling XLA in tensorflow 2.16 causes memory leaks tensorflow/tensorflow#64170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert the default for parallel fits to a single model per replica #2127

Revert the default for parallel fits to a single model per replica #2127

scarlehoff commented Jul 16, 2024 •

edited

Loading

goord commented Jul 18, 2024

scarlehoff commented Jul 18, 2024

goord commented Jul 18, 2024

scarlehoff commented Jul 18, 2024

scarlehoff commented Jul 19, 2024

Radonirinaunimi commented Jul 19, 2024

goord commented Jul 19, 2024

Cmurilochem commented Jul 20, 2024 •

edited

Loading

scarlehoff commented Jul 20, 2024

scarlehoff commented Jul 20, 2024

goord commented Jul 21, 2024

Cmurilochem commented Jul 22, 2024 •

edited

Loading

scarlehoff commented Jul 22, 2024

scarlehoff commented Jul 22, 2024

scarlehoff commented Jul 24, 2024

scarlehoff commented Jul 24, 2024

Cmurilochem commented Jul 24, 2024

scarlehoff commented Jul 24, 2024

Cmurilochem commented Jul 25, 2024

Cmurilochem commented Jul 25, 2024

scarlehoff commented Jul 25, 2024

Cmurilochem commented Jul 25, 2024

Revert the default for parallel fits to a single model per replica #2127

Revert the default for parallel fits to a single model per replica #2127

Conversation

scarlehoff commented Jul 16, 2024 • edited Loading

goord commented Jul 18, 2024

scarlehoff commented Jul 18, 2024

goord commented Jul 18, 2024

scarlehoff commented Jul 18, 2024

scarlehoff commented Jul 19, 2024

Radonirinaunimi commented Jul 19, 2024

goord commented Jul 19, 2024

Cmurilochem commented Jul 20, 2024 • edited Loading

scarlehoff commented Jul 20, 2024

scarlehoff commented Jul 20, 2024

goord commented Jul 21, 2024

Cmurilochem commented Jul 22, 2024 • edited Loading

scarlehoff commented Jul 22, 2024

scarlehoff commented Jul 22, 2024

scarlehoff commented Jul 24, 2024

scarlehoff commented Jul 24, 2024

Cmurilochem commented Jul 24, 2024

scarlehoff commented Jul 24, 2024

Cmurilochem commented Jul 25, 2024

Cmurilochem commented Jul 25, 2024

scarlehoff commented Jul 25, 2024

Cmurilochem commented Jul 25, 2024

scarlehoff commented Jul 16, 2024 •

edited

Loading

Cmurilochem commented Jul 20, 2024 •

edited

Loading

Cmurilochem commented Jul 22, 2024 •

edited

Loading