-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert the default for parallel fits to a single model per replica #2127
Conversation
e4e9387
to
0a1b331
Compare
b9c4498
to
681e4af
Compare
a3ea3ac
to
1637dba
Compare
681e4af
to
0a56a45
Compare
Hi @scarlehoff I can do the TF2.17/python3.12 test (on a CPU though :)) |
In CPU works fine (some of the tests are running tf 2.17). But since there were big changes in TF 2.16 I wonder whether it changed something in GPU. |
Ok I will ask @Cmurilochem to do the test, I don't have access to a GPU right now. I will start looking for the bug in the |
No, that one (at least according to the runcard) was run sequentially. |
If one of you confirm it also works for you, we can merge this. |
I can test this over the weekend. |
I am suspicious whether the multidense implementation results in the correct gradient descent process. Comparing the single-dense to the multi-dense for parallel replicas, I see quite different behavior of the pdf training evolution. The single-dense jumps around a lot more than the multi-dense pdf, I'm not sure we can attribute that to numerical noise... |
Hi @scarlehoff.
The results match nearly perfectly the ones from the sequential As of today I am not able however to run a parallel GPU fit with newer versions of tensorflow:
|
Nice, I face the second issue (my gpu is not recognized). Can you disable XLA and see whether it works?
it will be slower but we just want to check whether it works |
My guess (which I haven't tested) is that when you do a single model per replica, each model is trained independently (so, they all descend the gradient independently). Instead, when you have one single model, they all have to descend the gradient together. If you test with a small number of models chances are that, even if they are descending together, they both converge. The more models you add, the less likely it is that they are all going to converge at once. |
Right, in my opinion the multidense approach is rather dangerous. Any operation that mixes the weights in the replica dimension will lead errors. It means that you need to know exactly what goes on in the optimizer for instance. |
Thanks @scarlehoff. I managed to make my python 3.12 + tf 2.17.0 to recognise the gpu (I just forgot to WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1721631255.503027 3888332 service.cc:146] XLA service 0x14d384005100 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1721631255.504364 3888332 service.cc:154] StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2024-07-22 08:54:23.168637: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-22 08:54:51.618186: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
2024-07-22 08:55:16.840733: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 132 bytes spill stores, 132 bytes spill loads
2024-07-22 08:55:17.070172: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 96 bytes spill stores, 96 bytes spill loads
2024-07-22 08:55:17.145386: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 468 bytes spill stores, 468 bytes spill loads
2024-07-22 08:55:17.158346: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_7823', 88 bytes spill stores, 88 bytes spill loads
I0000 00:00:1721631365.216225 3888332 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
/var/spool/slurm/slurmd/job7079977/slurm_script: line 53: 3886507 Killed n3fit ../../nnpdf40-like.yml 1 -r 120 Without this flag, the calc gets stuck after generating the models and the fit never starts.... |
It does look like a memory leak.
But it looks like XLA is still active so might still be the same issue you linked before. |
@Cmurilochem please run setting This should make it 1) Run slower 2) Not compile anything But I hope it is not much slower. If that works we merge this and later on when all systems move to tf > 2.16 we can deal with the possible memory leak (if not fixed in subsequent versions of TF). |
6640d0d
to
c695d27
Compare
As far as I understand, this is now going to be what is going to be used for the following runs, so I'll merge it to master once the tests pass. |
I tested this and seem to be possible to run. Hopefully we can get true (compiled and everything) runs in GPU with tf > 2.16 but at the moment it looks more like a bug in the tf side? |
Hi @scarlehoff. I tested setting |
How many replicas? I tried with ~20 and seemed ok. |
I tried with 120 replicas as my goal was to compare the fit with the baseline. But it was proven very expensive. Did you manage to check that with 20 replicas we reproduce the sequential fits ? I guess the idea was just to check if we could run with the new versions. |
Using the newly merged master with the changes from this PR, I am having a strange error in my hyperopt run I've never seen: Traceback (most recent call last):
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/n3fit/src/n3fit/scripts/n3fit_exec.py", line 333, in run
super().run()
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/app.py", line 151, in run
super().run()
File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/app.py", line 406, in run
rb.execute_sequential()
File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/resourcebuilder.py", line 210, in execute_sequential
result = self.get_result(callspec.function,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/reportengine/resourcebuilder.py", line 380, in get_result
fres = function(**kwdict)
^^^^^^^^^^^^^^^^^^
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/covmats.py", line 249, in dataset_t0_predictions
return central_predictions(dataset, t0set).to_numpy().reshape(-1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/convolution.py", line 236, in central_predictions
return _predictions(dataset, pdf, central_fk_predictions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/convolution.py", line 168, in _predictions
fk_w_cuts = fk.load_with_cuts(cuts)
^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/core.py", line 571, in load_with_cuts
return load_fktable(self).with_cuts(cuts)
^^^^^^^^^^^^^^^^^^
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/fkparser.py", line 61, in load_fktable
tabledata = pineappl_reader(spec)
^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/home5/crocha/nnpdfgit/experiments/nnpdf_master/validphys2/src/validphys/pineparser.py", line 233, in pineappl_reader
lumi_columns = _pinelumi_to_columns(p.channels(), hadronic)
^^^^^^^^^^
File "/home/crocha/.conda/envs/tf_15_nnpdf_gpu/lib/python3.11/site-packages/pineappl/utils.py", line 17, in __getattr__
return self._raw.__getattribute__(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'builtins.PyFkTable' object has no attribute 'channels' @scarlehoff or @Radonirinaunimi. Any thought on that ? |
You need to reinstall pineappl. This ia due to an update of the pineappl package. |
Oh...it worked! Thanks!! |
This PR reverts back to a model-per-replica configuration (instead of the multidense layer)
since the multidense layer is not able to fit models independently (at the moment) which is necessary for a complete fit.
For sequential fits nothing should change, for parallel fit the difference is of about a factor of 2 in time
(but I'm guessing that this factor of 2 we are gaining might be why the fits are of worse quality).
I've introduced a
multidense
layer type if people want to play around withmultidense
.Here two reports, as you can see
multidense
has a terrible chi2 whiledense
is statistically compatible with 4.0.multidense
: https://vp.nnpdf.science/UGSJ6OsPTdSR2s97OXIvDw==dense
: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==I haven't looked deep into the details of the error but I don't think
multidense
was truly benchmarked in a real-life scenario? At least I could not find any examples.Anyway, for reference, these were fits of 120 replicas in a RTX 2080. 80 minutes with
dense
, 40 minutes withmultidense
.It would be ideal if
multidense
could be fixed but I can live with 80 minutes runs... for nowchecks.py
will fail if you usemultidense
to make sure it is only used as a development option for now.Fixes, in a way, #2118
TODO:
.json
files have no changes (proving that, for one replica,multidense
anddense
were really the same)RE the last point, I cannot check but the python tests are getting TF 2.17 so the code does work.