Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch profiler is unable to serialize numpy datatypes sometimes inserted as process group ranks #177

Open
hatanp opened this issue May 22, 2024 · 0 comments

Comments

@hatanp
Copy link

hatanp commented May 22, 2024

Some process groups are initialized with ranks in Numpy arrays and sometimes with lists. The Numpy datatypes cause issues when the profiler tries to serialize dist_info to a JSON:

[rank0]:     prof.step()
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 727, in step
[rank0]:     self._transit_action(prev_action, self.current_action)
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 744, in _transit_action
[rank0]:     action()
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 177, in start_trace
[rank0]:     self.add_metadata_json("distributedInfo", json.dumps(dist_info))
[rank0]:                                               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/__init__.py", line 231, in dumps
[rank0]:     return _default_encoder.encode(obj)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 200, in encode
[rank0]:     chunks = self.iterencode(o, _one_shot=True)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 258, in iterencode
[rank0]:     return _iterencode(o, 0)
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 180, in default
[rank0]:     raise TypeError(f'Object of type {o.__class__.__name__} '
[rank0]: TypeError: Object of type int64 is not JSON serializable

You can inspect the process group ranks data type with:

dist.distributed_c10d._get_all_pg_configs()
configs = dist.distributed_c10d._get_all_pg_configs()
for config in configs:
     print(f"{config['ranks'][0].__class__.__name__}")

I could get around the issue by modifying nanotron.distributed.new_group and adding

    if isinstance(ranks, np.ndarray):
        ranks = ranks.tolist()

however I am not sure this is the fix for the long term. Looking around I see that ParallelContext.create_new_group has a type hint for np.ndarray but sometimes gets a list as well. Ideally these should always be in the same format and respect the type hints. Inserting regular integers to the torch.distributed.new_group is probably a better idea not to break things like the profiler. Torch documentation has the input as ranks being of datatype list[int] as well. An alternative to modification proposed above would be to have an assert here to ensure the input is a list and then modify the code elsewhere to only input lists to nanotron.distributed.new_group.

Python: 3.11.8
PyTorch: 2.3.0
nanotron: up to date main branch
Config: created by examples/bench_llama_7b.py with profiler enabled:

profiler:
  profiler_export_path: ./checkpoints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant