You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some process groups are initialized with ranks in Numpy arrays and sometimes with lists. The Numpy datatypes cause issues when the profiler tries to serialize dist_info to a JSON:
[rank0]: prof.step()
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 727, in step
[rank0]: self._transit_action(prev_action, self.current_action)
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 744, in _transit_action
[rank0]: action()
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 177, in start_trace
[rank0]: self.add_metadata_json("distributedInfo", json.dumps(dist_info))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/__init__.py", line 231, in dumps
[rank0]: return _default_encoder.encode(obj)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 200, in encode
[rank0]: chunks = self.iterencode(o, _one_shot=True)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 258, in iterencode
[rank0]: return _iterencode(o, 0)
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 180, in default
[rank0]: raise TypeError(f'Object of type {o.__class__.__name__} '
[rank0]: TypeError: Object of type int64 is not JSON serializable
You can inspect the process group ranks data type with:
dist.distributed_c10d._get_all_pg_configs()
configs = dist.distributed_c10d._get_all_pg_configs()
for config in configs:
print(f"{config['ranks'][0].__class__.__name__}")
I could get around the issue by modifying nanotron.distributed.new_group and adding
if isinstance(ranks, np.ndarray):
ranks = ranks.tolist()
however I am not sure this is the fix for the long term. Looking around I see that ParallelContext.create_new_group has a type hint for np.ndarray but sometimes gets a list as well. Ideally these should always be in the same format and respect the type hints. Inserting regular integers to the torch.distributed.new_group is probably a better idea not to break things like the profiler. Torch documentation has the input as ranks being of datatype list[int] as well. An alternative to modification proposed above would be to have an assert here to ensure the input is a list and then modify the code elsewhere to only input lists to nanotron.distributed.new_group.
Python: 3.11.8
PyTorch: 2.3.0
nanotron: up to date main branch
Config: created by examples/bench_llama_7b.py with profiler enabled:
profiler:
profiler_export_path: ./checkpoints
The text was updated successfully, but these errors were encountered:
Some process groups are initialized with ranks in Numpy arrays and sometimes with lists. The Numpy datatypes cause issues when the profiler tries to serialize dist_info to a JSON:
You can inspect the process group ranks data type with:
I could get around the issue by modifying
nanotron.distributed.new_group
and addinghowever I am not sure this is the fix for the long term. Looking around I see that
ParallelContext.create_new_group
has a type hint fornp.ndarray
but sometimes gets a list as well. Ideally these should always be in the same format and respect the type hints. Inserting regular integers to thetorch.distributed.new_group
is probably a better idea not to break things like the profiler. Torch documentation has the input as ranks being of datatypelist[int]
as well. An alternative to modification proposed above would be to have an assert here to ensure the input is a list and then modify the code elsewhere to only input lists tonanotron.distributed.new_group
.Python: 3.11.8
PyTorch: 2.3.0
nanotron: up to date main branch
Config: created by
examples/bench_llama_7b.py
with profiler enabled:The text was updated successfully, but these errors were encountered: