feat: add context parallel support for SFT #420

ashors1 · 2024-11-27T06:24:02Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

Signed-off-by: ashors1 <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: ashors1 <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: Anna Shors <[email protected]>

ashors1 · 2024-12-02T20:43:56Z

Note that CP support for SFT was recently added to NeMo. We need a NeMo commit at least as recent as 8c921dc19a905d8b5a0f90f6e2a34607c2e0660d

Signed-off-by: ashors1 <[email protected]>

This reverts commit ea3b5ba.

Signed-off-by: ashors1 <[email protected]>

terrykong

TODO:

potentially rebase after https://github.com/NVIDIA/NeMo-Aligner/actions/runs/12149989736/job/33882060795?pr=405#step:5:743
- presubmit didn't pass for SFT on that PR, so if that doesn't get resolved by thursday, let's just merge this as is assuming CI passes since feat: support new DPO data format and update SFT config to use override API #405 should be able to rebase on this easily

terrykong · 2024-12-04T00:39:59Z

examples/nlp/gpt/conf/gpt_sft.yaml

  restore_from_path: ??? # Path to an existing p-tuned/prompt tuned .nemo model you wish to add new tasks to or run inference with
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  save_nemo_on_validation_end: True # Saves an inference ready .nemo file every time a checkpoint is saved during training.
  sync_batch_comm: False
  megatron_amp_O2: False
  encoder_seq_length: 4096  # the sequence length of the encoder model, it will be overwriten by loaded GPT model
+  transformer_engine: True


For my education, why do we need to specify this now?

I don't actually think it's necessary since TE is enabled by default. I just wanted to make explicit the fact that we were using TE. But I will remove this

terrykong · 2024-12-04T00:50:50Z

nemo_aligner/data/nlp/builders.py

+    pad_seq_length_to_mult = 16
+    if model_cfg is not None:
+        pad_seq_length_to_mult = (
+            8 * model_cfg.get("tensor_model_parallel_size", 1) if model_cfg.get("sequence_parallel", False) else 16
+        )
+        pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1)


According to that fp8 comment above, should this be:

Suggested change

pad_seq_length_to_mult = 16

if model_cfg is not None:

pad_seq_length_to_mult = (

8 * model_cfg.get("tensor_model_parallel_size", 1) if model_cfg.get("sequence_parallel", False) else 16

)

pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1)

pad_seq_length_to_mult = 16

if model_cfg is not None:

if model_cfg.get("sequence_parallel", False):

pad_seq_length_to_mult = math.lcm(pad_seq_length_to_mult, model_cfg.get("tensor_model_parallel_size", 1))

pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1)

? From the comment it sounds like if someone is doing fp8 SFT with TP=1 and set sequence_parallel, then the padding would be too small

That chunk was taken directly from here: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L262-L268, but the code does seem to contradict the comment. I'll follow up with the TE team

From the TE team: "when SP=True, the dimensions are flipped so the sequence dimension is first. So we only need to make sure it's divisible by 8 after the TP split to comply with TE's expectations."

terrykong · 2024-12-04T00:51:05Z

nemo_aligner/models/nlp/gpt/gpt_sft_model.py

@@ -88,7 +88,7 @@ def get_loss_and_metrics(self, batch, forward_only):
        set_sync_funcs(self, forward_only)

        fwd_bwd_function = get_forward_backward_func()
-        fwd_loss_fn = self.get_forward_output_and_loss_func(forward_only)
+        fwd_loss_fn = self.get_forward_output_and_loss_func(forward_only, tuning=True)


what does tuning do?

It controls the keys that are returned in the batch: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L1211-L1222. If tuning=False, we don't return the keys that are necessary for sequence packing (and thus CP) in TE.

Note also that tuning is set to True in NeMo's SFT: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L407

ashors1 · 2024-12-04T22:48:54Z

closing in favor of #430. @terrykong I'll address your comments there

ashors1 and others added 4 commits September 3, 2024 13:50

update pad_sequence_length_to_mult for context parallel

ee06038

Signed-off-by: ashors1 <[email protected]>

Merge branch 'main' of github.com:NVIDIA/NeMo-Aligner into ashors/cp-sft

0a5c5da

bug fix

5b26ab3

Signed-off-by: ashors1 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2919f22

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

ashors1 marked this pull request as draft November 27, 2024 06:25

ashors1 added 3 commits November 27, 2024 12:43

make cp size and TE configurable

35f8be9

Signed-off-by: ashors1 <[email protected]>

Merge branch 'main' of github.com:NVIDIA/NeMo-Aligner into ashors/cp-sft

06caee7

update changelog

c17c489

Signed-off-by: ashors1 <[email protected]>

ashors1 changed the title ~~Add context parallel support for SFT~~ feat: add context parallel support for SFT Nov 27, 2024

ashors1 and others added 3 commits November 29, 2024 21:59

fixes

82b5166

Signed-off-by: ashors1 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b31d38

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

fix

f8b5130

Signed-off-by: Anna Shors <[email protected]>

ashors1 marked this pull request as ready for review December 2, 2024 20:40

ashors1 added the Run CICD Set + un-set to retrigger label Dec 2, 2024

ashors1 requested a review from terrykong December 2, 2024 20:40

Merge branch 'main' of github.com:NVIDIA/NeMo-Aligner into ashors/cp-sft

4e8e7bf

ashors1 added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Dec 2, 2024

enable te in sft config

19467e8

Signed-off-by: ashors1 <[email protected]>

ashors1 added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Dec 2, 2024

ashors1 changed the base branch from main to dev December 2, 2024 23:29

ashors1 changed the base branch from dev to main December 2, 2024 23:29

ashors1 added 3 commits December 2, 2024 22:29

update build_sft_dataset

ea3b5ba

Signed-off-by: ashors1 <[email protected]>

Revert "update build_sft_dataset"

aa717f1

This reverts commit ea3b5ba.

bump nemo and mcore versions

165bcb6

Signed-off-by: ashors1 <[email protected]>

ashors1 added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Dec 3, 2024

remove old cherry-picks

480dd16

Signed-off-by: ashors1 <[email protected]>

ashors1 added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Dec 3, 2024

fix typo

77436c9

Signed-off-by: ashors1 <[email protected]>

ashors1 added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Dec 4, 2024

terrykong requested changes Dec 4, 2024

View reviewed changes

terrykong changed the base branch from main to dev December 4, 2024 00:58

ashors1 changed the base branch from dev to main December 4, 2024 06:43

ashors1 removed the Run CICD Set + un-set to retrigger label Dec 4, 2024

ashors1 closed this Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add context parallel support for SFT #420

feat: add context parallel support for SFT #420

ashors1 commented Nov 27, 2024

ashors1 commented Dec 2, 2024

terrykong left a comment

terrykong Dec 4, 2024

ashors1 Dec 4, 2024

terrykong Dec 4, 2024

ashors1 Dec 4, 2024

ashors1 Dec 4, 2024

terrykong Dec 4, 2024

ashors1 Dec 4, 2024

ashors1 commented Dec 4, 2024

feat: add context parallel support for SFT #420

feat: add context parallel support for SFT #420

Conversation

ashors1 commented Nov 27, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

ashors1 commented Dec 2, 2024

terrykong left a comment

Choose a reason for hiding this comment

terrykong Dec 4, 2024

Choose a reason for hiding this comment

ashors1 Dec 4, 2024

Choose a reason for hiding this comment

terrykong Dec 4, 2024

Choose a reason for hiding this comment

ashors1 Dec 4, 2024

Choose a reason for hiding this comment

ashors1 Dec 4, 2024

Choose a reason for hiding this comment

terrykong Dec 4, 2024

Choose a reason for hiding this comment

ashors1 Dec 4, 2024

Choose a reason for hiding this comment

ashors1 commented Dec 4, 2024