Cannot run the Model generated from the example script #251

hz-nm · 2024-11-27T12:56:14Z

I was testing out the library for the model on a single GPU for training.
Used the following command to run the training,

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=1 run_train.py --config-file examples/config_tiny_llama.yaml

Made some changes in the config_tiny_llama.yaml file which include,

parallelism:
  dp: 1 # 2
  expert_parallel_size: 1
  pp: 1 # 2
  pp_engine: 1f1b
  tp: 1 # 2
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER

The training ran smoothly and the checkpoints were generated, however when I try to run the model using,

torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1

I get the following error,

[rank0]:   File "/mnt/d/nanotron-pretrain/nanotron/src/nanotron/models/llama.py", line 529, in forward
[rank0]:     (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 4)

Any help to resolve this issue will be greatly appreciated. Thanks.

The text was updated successfully, but these errors were encountered:

hz-nm · 2024-11-28T08:03:24Z

So I changed the lines a bit in the models/llama.py, i.e. added an extra parameter

(query_unpad, indices_q, cu_seqlens_q, max_seqlen_q, _) = bert_padding.unpad_input(
                    query_states,
                    sequence_mask,
                )
(key_unpad, indices_k, cu_seqlens_k, max_seqlen_k, _) = bert_padding.unpad_input(
                    key_states, sequence_mask
                )
(value_unpad, _, _, _, _) = bert_padding.unpad_input(value_states, sequence_mask)

Here are the generations,

11/28/2024 12:45:40 [INFO|DP=0|PP=0|TP=0]: input: [CLS] the [UNK] [UNK] [UNK] is [SEP]
11/28/2024 12:45:40 [INFO|DP=0|PP=0|TP=0]: generation: [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run the Model generated from the example script #251

Cannot run the Model generated from the example script #251

hz-nm commented Nov 27, 2024 •

edited

Loading

hz-nm commented Nov 28, 2024 •

edited

Loading

Cannot run the Model generated from the example script #251

Cannot run the Model generated from the example script #251

Comments

hz-nm commented Nov 27, 2024 • edited Loading

hz-nm commented Nov 28, 2024 • edited Loading

hz-nm commented Nov 27, 2024 •

edited

Loading

hz-nm commented Nov 28, 2024 •

edited

Loading