Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert the default for parallel fits to a single model per replica #2127

Merged
merged 4 commits into from
Jul 24, 2024

Conversation

scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Jul 16, 2024

This PR reverts back to a model-per-replica configuration (instead of the multidense layer)
since the multidense layer is not able to fit models independently (at the moment) which is necessary for a complete fit.

For sequential fits nothing should change, for parallel fit the difference is of about a factor of 2 in time
(but I'm guessing that this factor of 2 we are gaining might be why the fits are of worse quality).

I've introduced a multidense layer type if people want to play around with multidense.

Here two reports, as you can see multidense has a terrible chi2 while dense is statistically compatible with 4.0.

multidense: https://vp.nnpdf.science/UGSJ6OsPTdSR2s97OXIvDw==
dense: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==

I haven't looked deep into the details of the error but I don't think multidense was truly benchmarked in a real-life scenario? At least I could not find any examples.

Anyway, for reference, these were fits of 120 replicas in a RTX 2080. 80 minutes with dense, 40 minutes with multidense.
It would be ideal if multidense could be fixed but I can live with 80 minutes runs... for now checks.py will fail if you use multidense to make sure it is only used as a development option for now.

Fixes, in a way, #2118

TODO:

  • Fix regression tests, note that most of the .json files have no changes (proving that, for one replica, multidense and dense were really the same)
  • Check it works with TF 2.16/2.17 (if I'm able to get a computer with both a GPU and a recent-enough version of python/tf at the same time, not trivial)

RE the last point, I cannot check but the python tests are getting TF 2.17 so the code does work.

@scarlehoff scarlehoff changed the base branch from master to avoid-idle-gpu July 16, 2024 07:57
@scarlehoff scarlehoff linked an issue Jul 16, 2024 that may be closed by this pull request
@scarlehoff scarlehoff added the redo-regressions Recompute the regression data label Jul 16, 2024
@scarlehoff scarlehoff added redo-regressions Recompute the regression data and removed redo-regressions Recompute the regression data labels Jul 16, 2024
@scarlehoff scarlehoff removed the redo-regressions Recompute the regression data label Jul 16, 2024
@scarlehoff scarlehoff mentioned this pull request Jul 17, 2024
Base automatically changed from avoid-idle-gpu to master July 17, 2024 13:34
@goord
Copy link
Collaborator

goord commented Jul 18, 2024

Hi @scarlehoff I can do the TF2.17/python3.12 test (on a CPU though :))

@scarlehoff
Copy link
Member Author

In CPU works fine (some of the tests are running tf 2.17). But since there were big changes in TF 2.16 I wonder whether it changed something in GPU.

@goord
Copy link
Collaborator

goord commented Jul 18, 2024

Ok I will ask @Cmurilochem to do the test, I don't have access to a GPU right now. I will start looking for the bug in the multidense layer type.

@scarlehoff
Copy link
Member Author

No, that one (at least according to the runcard) was run sequentially.

@scarlehoff
Copy link
Member Author

If one of you confirm it also works for you, we can merge this.

@Radonirinaunimi
Copy link
Member

If one of you confirm it also works for you, we can merge this.

I can test this over the weekend.

@goord
Copy link
Collaborator

goord commented Jul 19, 2024

I am suspicious whether the multidense implementation results in the correct gradient descent process. Comparing the single-dense to the multi-dense for parallel replicas, I see quite different behavior of the pdf training evolution. The single-dense jumps around a lot more than the multi-dense pdf, I'm not sure we can attribute that to numerical noise...

@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Jul 20, 2024

If one of you confirm it also works for you, we can merge this.

Hi @scarlehoff.
I tested this branch as request and ran a parallel fit using GPUs and python 3.11 + tf 2.15.1; see below.

python 3.11 + tf 2.15.1
https://vp.nnpdf.science/ICdkTBY0T2ebanL1lLNcMg==

The results match nearly perfectly the ones from the sequential NNPDF40_nnlo_as_01180_qcd one.

As of today I am not able however to run a parallel GPU fit with newer versions of tensorflow:

@scarlehoff
Copy link
Member Author

Nice, I face the second issue (my gpu is not recognized).

Can you disable XLA and see whether it works?

XLA_FLAGS=--xla_disabled_backends=cpu,gpu

it will be slower but we just want to check whether it works

@scarlehoff
Copy link
Member Author

I am suspicious whether the multidense implementation results in the correct gradient descent process.

My guess (which I haven't tested) is that when you do a single model per replica, each model is trained independently (so, they all descend the gradient independently).

Instead, when you have one single model, they all have to descend the gradient together. If you test with a small number of models chances are that, even if they are descending together, they both converge. The more models you add, the less likely it is that they are all going to converge at once.

@goord
Copy link
Collaborator

goord commented Jul 21, 2024

Right, in my opinion the multidense approach is rather dangerous. Any operation that mixes the weights in the replica dimension will lead errors. It means that you need to know exactly what goes on in the optimizer for instance.

@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Jul 22, 2024

Nice, I face the second issue (my gpu is not recognized).

Can you disable XLA and see whether it works?

XLA_FLAGS=--xla_disabled_backends=cpu,gpu

it will be slower but we just want to check whether it works

Thanks @scarlehoff. I managed to make my python 3.12 + tf 2.17.0 to recognise the gpu (I just forgot to pip install tensorflow[and-cuda]). I tried to disable XLA by setting up your suggest environment variable but it unfortunately does not work, with the calculation being killed after some time.

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1721631255.503027 3888332 service.cc:146] XLA service 0x14d384005100 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1721631255.504364 3888332 service.cc:154]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2024-07-22 08:54:23.168637: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-22 08:54:51.618186: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
2024-07-22 08:55:16.840733: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 132 bytes spill stores, 132 bytes spill loads

2024-07-22 08:55:17.070172: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 96 bytes spill stores, 96 bytes spill loads

2024-07-22 08:55:17.145386: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 468 bytes spill stores, 468 bytes spill loads

2024-07-22 08:55:17.158346: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 88 bytes spill stores, 88 bytes spill loads

I0000 00:00:1721631365.216225 3888332 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
/var/spool/slurm/slurmd/job7079977/slurm_script: line 53: 3886507 Killed                  n3fit ../../nnpdf40-like.yml 1 -r 120

Without this flag, the calc gets stuck after generating the models and the fit never starts....

@scarlehoff
Copy link
Member Author

It does look like a memory leak.

I0000 00:00:1721631365.216225 3888332 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.

But it looks like XLA is still active so might still be the same issue you linked before.

@scarlehoff
Copy link
Member Author

@Cmurilochem please run setting set_eager(True) here

This should make it 1) Run slower 2) Not compile anything

But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF).

@scarlehoff
Copy link
Member Author

As far as I understand, this is now going to be what is going to be used for the following runs, so I'll merge it to master once the tests pass.

@scarlehoff
Copy link
Member Author

@Cmurilochem please run setting set_eager(True) here

This should make it 1) Run slower 2) Not compile anything

But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF).

I tested this and seem to be possible to run. Hopefully we can get true (compiled and everything) runs in GPU with tf > 2.16 but at the moment it looks more like a bug in the tf side?

@scarlehoff scarlehoff merged commit 803fab7 into master Jul 24, 2024
6 checks passed
@scarlehoff scarlehoff deleted the revert_multidense_default branch July 24, 2024 13:04
@Cmurilochem
Copy link
Collaborator

@Cmurilochem please run setting set_eager(True) here

This should make it 1) Run slower 2) Not compile anything
But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF).

I tested this and seem to be possible to run. Hopefully we can get true (compiled and everything) runs in GPU with tf > 2.16 but at the moment it looks more like a bug in the tf side?

Hi @scarlehoff. I tested setting set_eager(True) for both tf 16 and 17 (both with python 12). The fit took more than 30h and did not finish after all. We are moving on to performing the new hyperopt experiments with py 3.11 and tf 15, hopping that this can be addressed later.

@scarlehoff
Copy link
Member Author

How many replicas? I tried with ~20 and seemed ok.

@Cmurilochem
Copy link
Collaborator

How many replicas? I tried with ~20 and seemed ok.

I tried with 120 replicas as my goal was to compare the fit with the baseline. But it was proven very expensive. Did you manage to check that with 20 replicas we reproduce the sequential fits ? I guess the idea was just to check if we could run with the new versions.

@Cmurilochem
Copy link
Collaborator

Using the newly merged master with the changes from this PR, I am having a strange error in my hyperopt run I've never seen:

Traceback (most recent call last):
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/n3fit/src/n3fit/scripts/n3fit_exec.py", line 333, in run
    super().run()
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/app.py", line 151, in run
    super().run()
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/app.py", line 406, in run
    rb.execute_sequential()
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/resourcebuilder.py", line 210, in execute_sequential
    result = self.get_result(callspec.function,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/resourcebuilder.py", line 380, in get_result
    fres = function(**kwdict)
           ^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/covmats.py", line 249, in dataset_t0_predictions
    return central_predictions(dataset, t0set).to_numpy().reshape(-1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/convolution.py", line 236, in central_predictions
    return _predictions(dataset, pdf, central_fk_predictions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/convolution.py", line 168, in _predictions
    fk_w_cuts = fk.load_with_cuts(cuts)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/core.py", line 571, in load_with_cuts
    return load_fktable(self).with_cuts(cuts)
           ^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/fkparser.py", line 61, in load_fktable
    tabledata = pineappl_reader(spec)
                ^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/pineparser.py", line 233, in pineappl_reader
    lumi_columns = _pinelumi_to_columns(p.channels(), hadronic)
                                        ^^^^^^^^^^
  File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/pineappl/utils.py", line 17, in __getattr__
    return self._raw.__getattribute__(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'builtins.PyFkTable' object has no attribute 'channels'

@scarlehoff or @Radonirinaunimi. Any thought on that ?

@scarlehoff
Copy link
Member Author

You need to reinstall pineappl. This ia due to an update of the pineappl package.

@Cmurilochem
Copy link
Collaborator

You need to reinstall pineappl. This ia due to an update of the pineappl package.

Oh...it worked! Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No agreement between parallel gpu and sequential cpu fits
5 participants