Merge branch 'main' into dev

Signed-off-by: Terry Kong <[email protected]>
NVIDIA · Dec 7, 2024 · f12dd98 · f12dd98
2 parents 3604fc4 + 7a2d427
commit f12dd98
Show file tree

Hide file tree

Showing 19 changed files with 1,196 additions and 193 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -92,6 +92,7 @@ jobs:
           - ppo-llama3-pp2-reshard
           - reinforce-llama3-pp2-reshard
           - dpo-llama3
+          - dpo-llama3-pack
           - kd-llama3
           - sft-llama3
           - sft-llama3-cp

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,6 +19,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 
 ### New Features and Optimizations
 - Added context parallel support for SFT. CP can be enabled by setting `model.context_parallel_size` in your config.
+- Sequence packing is now supported when running DPO.
 - Added support for Knowledge Distillation with SFT. See the [tutorial](docs/user-guide/knowledge-distillation.rst) for details.
 - Added support for Megatron Core’s distributed optimizer, which can be configured using `++model.optim.name=mcore_distributed_optim`.
 - Introduced `ScopedTimer` as a successor to `SyncedTimer`. `SyncedTimer` is marked for deprecation and will be removed in the next version.

diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst
@@ -27,6 +27,7 @@ The algorithm is identified with the ``dpo.preference_loss`` config variable. We
 
 To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used.
 
+
 Obtain a Pretrained Model
 #########################
 To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. 
@@ -80,6 +81,9 @@ For best DPO training performance, it is recommended that you start with a SFT m
 DPO Model Training
 ##################
 
+Prepare your Dataset
+====================
+
 Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
 
    {"prompt": "Which year was the Magna Carta signed?", "chosen_response": "1215", "rejected_response": "I refuse to answer this question."}
@@ -94,6 +98,63 @@ Always follow the prompt-response template format used during your SFT training
 
 Your JSONL file must contain at least as many samples as the Global Batch Size (GBS) you plan to use during training. For example, if GBS = 64, ensure that both your training and validation files include at least 64 samples. Using a file with fewer samples than the GBS will result in a crash.
 
+Sequence Packing with DPO
+=========================
+
+We also support packed sequence training with DPO. Sequence packing is a training technique in which multiple training examples are concatenated to create one longer sequence. This approach eliminates the need for padding and improves GPU utilization.
+Refer to the `sequence packing documentation <https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/optimizations/sequence_packing.html?highlight=packing#>`_ for a detailed overview of sequence packing and its advantages. This document 
+discusses sequence packing for SFT in particular, but the same benefits apply to DPO.
+
+Packing your DPO dataset is done as a preprocessing step in NeMo and NeMo-Aligner. We provide a `script https://github.com/NVIDIA/NeMo-Aligner/blob/ashors/dpo-packing/examples/nlp/data/dpo/prepare_packed_dpo_dataset.py`_ to pack your DPO dataset. This script assumes you already have a prepared DPO-format dataset. Three main steps are run in this script:
+
+  #. The online processing code in ``DPOModelDataset`` is run. This includes tasks such as prompt template manipulation and tokenization. The result is an array of tokenized sequences, represented by indices.
+  #. Chosen and rejected sequences are concatenated.
+  #. The tokenized sequences are grouped by length and a packing algorithm is run.
+
+
+You can read more about packing algorithms  `here <https://en.wikipedia.org/wiki/Bin_packing_problem#Offline_algorithms>`_. Currently, two variants of ``first_fit`` are supported:
+
+  #. ``first_fit_decreasing``: sorts the sequences in decreasing order before applying the first-fit algorithm. It generates a more optimal packing, but it tends to keep all short sequences together, which may have an impact for convergence.
+  #. ``first_fit_shuffle``: runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random. The recommendation is to run first_fit_shuffle and check the packed sequence lengths. If they are similar to the target length (i.e. efficient packing), then use shuffle. Otherwise try first_fit_decreasing.
+
+
+The following is an example of running the packing script to prepare your DPO dataset:
+
+.. code-block:: bash
+
+   python examples/nlp/data/dpo/prepare_packed_dpo_dataset.py \
+      model.data.data_prefix=/path/to/training.jsonl \
+      +model.encoder_seq_length=2048 \
+      +tokenizer_path=/path/to/tokenizer/model \
+      +output_dir=/path/to/output_folder \
+      +pack_sizes=[4096] \
+      +tokenizer_type=<huggingface or sentencpiece>
+   [  +packing_algorithm=first_fit_shuffle \  ]
+   [  ++model.seed=0                          ]
+
+
+Because this script packs chosen and rejected sequences together, ``pack_sizes`` should always be at least double ``model.encoder_seq_length``.
+When running training using the packed dataset, ``model.encoder_seq_length`` should be set to the ``packed_size`` used for the packed dataset.
+
+To use the packed dataset during training, add the following line to your train command:
+
+.. code-block:: bash
+
+   ++model.data.data_impl=packed_jsonl
+
+
+A few notes to keep in mind when running training with sequence packing:
+
+  #. Make sure to pack your train, validation, and test datasets.
+  #. Sequence packing can only be run with a micro batch size of 1.
+  #. Sequence packing is supported via Transformer Engine, so be sure to enable transformer engine in your config by setting `++model.transformer_engine=True`.
+  #. Sequence packing increases the number of examples processed per global batch. Try to scale your global batch size accordingly by setting the new
+     global batch size to approximately ``unpacked_global_batch_size / avg_num_sequences_per_pack``. The average number of sequences per pack is printed to stdout after ``prepare_packed_dpo_dataset.py`` completes.
+
+
+Begin Training
+==============
+
 Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model.
 For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``.
 

diff --git a/docs/user-guide/knowledge-distillation.rst b/docs/user-guide/knowledge-distillation.rst
@@ -45,6 +45,7 @@ To start, we must first download both the pre-trained student and fine-tuned tea
         #. Download the `Llama3-8B LLM model and tokenizer <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ into the model's folder. You can use the Hugging Face CLI for this:
 
             .. code-block:: bash
+
                huggingface-cli download nvidia/nemotron-3-8b-chat-4k-sft --local-dir teacher_checkpoint
 
 After these steps, you should have files ``2b_student.nemo`` and ``teacher_checkpoint/Nemotron-3-8B-Chat-4k-SFT.nemo`` to use in NeMo-Aligner.

diff --git a/examples/nlp/data/dpo/prepare_packed_dpo_dataset.py b/examples/nlp/data/dpo/prepare_packed_dpo_dataset.py
@@ -0,0 +1,270 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from dataclasses import dataclass
+from typing import TYPE_CHECKING, Dict, List, Tuple
+
+import numpy as np
+import torch
+from tqdm import tqdm
+
+from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset import GPTSFTDataset
+from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
+from nemo.core.config import hydra_runner
+from nemo.utils import logging
+from nemo.utils.sequence_packing_utils import create_hist, create_packing_strategy
+from nemo_aligner.data.nlp.builders import build_train_valid_test_dpo_datasets
+from nemo_aligner.data.nlp.datasets import DPOModelDataset
+
+if TYPE_CHECKING:
+    from omegaconf import DictConfig
+
+""" 
+Script to prepare packed dataset from a DPO dataset in the jsonl format.
+Three  main steps are run in this script:
+1. The online processing code in DPOModelDataset is run (including prompt template manipulation, 
+sequence length truncation, tokenization, etc) and the result is an array of tokenized sequences, 
+represented by indices). 
+2. chosen and rejected sequences are concatenated for each example
+3. The sequences are grouped by length, and a packing algorithm is run. (https://en.wikipedia.org/wiki/Bin_packing_problem#Offline_algorithms)
+Currently, two variants of "first fit" are supported.
+"first_fit_decreasing" sorts the sequences in decreasing order before applying first-fit. 
+It generates a more optimal packing, but it tends to keep all short sequences together, which may affect convergence.
+"first_fit_shuffle" runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random.
+The recommendation is to run "first_fit_shuffle" and check the packed sequence lengths in the printout. 
+If they are similar to the target length (i.e. packing is efficient), then use shuffle. Otherwise try first_fit_decreasing.
+
+Example usage:
+
+python scripts/nlp_language_modeling/prepare_packed_dpo_dataset.py \
+   model.data.train_ds.file_names=[/path/to/training.jsonl] \
+   model.encoder_seq_length=1024 \
+   +tokenizer_path=<see note 1 below> \
+   +tokenizer_type=sentencepiece \
+   +output_dir=/path/to/output_folder \
+   +pack_sizes=[2048,4096,8192]
+   
+Note: 
+  - Tokenizer path supports SentencePiece tokenizer and HF tokenizer. 
+    For SentencePiece tokenizer, specify the file /path/to/tokenizer.model 
+    For HF tokenizer, specify a folder /path/to/hf_folder which contains tokenizer.json, tokenizer_config.json
+    and special_tokens_map.json or the HF name of the tokenizer to use (e.g. "meta-llama/Meta-Llama-3-8B")
+
+  - If your model or dataset requires non-default configs for DPO training in NeMo, you will
+    need to pass in the same configs to ``model.data.train_ds`` as you would for training with unpacked dataset.
+
+  - ``model.encoder_seq_length`` is the length to truncate each sequence before packing multiple sequences
+    to the size of packed sequence (``pack_size``). 
+
+  - ``pack_sizes`` is a list of packed sequence lengths. In this example, there will be three output files, one for
+    each pack size. The output files are named ``<output_folder>/packed_{pack_size}_seed{seed}.npy``.
+    This argument is a list because you will likely want to experiment with a few ``pack_sizes`` to find out which length
+    can fill the GPU memory without exceeding it. Adjusting ``pack_size`` is analogous to adjusting the micro batch size in
+    the unpacked case.
+      - **important**: ``pack_sizes`` should be at least double the value of model.encoder_seq_length in order to guarantee
+        that chosen and rejected sequences for a given example can be packed together.
+"""
+
+
+def tokenize_dataset(cfg: "DictConfig", tokenizer_type):
+    """
+    Tokenizes a dataset using the same configuration file as DPOModelDataset.
+
+    This function reads a dataset and tokenizes based on the provided configuration.
+
+    Args:
+      cfg: A Hydra configuration object containing parameters for tokenization.
+
+    Returns:
+      A NumPy array containing the tokenized sequences from the dataset.
+    """
+
+    logging.info("Tokenizing dataset...")
+
+    if tokenizer_type == "huggingface":
+        # pass in either a local Hugging Face folder which contains tokenizer.json or a path to the tokenizer on huggingface
+        tokenizer = get_nmt_tokenizer(library="huggingface", model_name=cfg.tokenizer_path, use_fast=True)
+    elif tokenizer_type == "sentencepiece":
+        tokenizer = get_nmt_tokenizer(library="sentencepiece", tokenizer_model=cfg.tokenizer_path)
+    else:
+        raise ValueError(f"unsupported tokenizer type {tokenizer_type}")
+
+    with open(cfg.model.data.data_prefix, "r", encoding="utf_8") as fr:
+        data_payload = [json.loads(line.strip()) for line in fr]
+    documents = np.arange(len(data_payload), step=1, dtype=np.int32)
+    dataset = DPOModelDataset(
+        cfg=cfg.model,
+        name="packing_dataset",
+        tokenizer=tokenizer,
+        data_prefix=cfg.model.data.data_prefix,
+        documents=documents,
+        data=data_payload,
+        seq_length=cfg.model.data.seq_length,
+        seed=cfg.model.get("seed", 1234),
+        drop_last=True,  ## False not currently supported
+        pad_chosen_rejected_to_max=False,
+    )
+
+    combined_dataset = []
+    for item in dataset:
+        if item["ignore_example"]:
+            continue
+        input_ids = torch.cat((item["chosen"], item["rejected"])).numpy()
+        labels = torch.cat((item["chosen_labels"], item["rejected_labels"])).numpy()
+        reward = torch.tensor([item["chosen_reward"], item["rejected_reward"]]).numpy()
+        boundary = len(item["chosen"])
+        lengths = np.array([item["chosen_length"], item["rejected_length"]])
+        new_item = {
+            "input_ids": input_ids,
+            "labels": labels,
+            "reward": reward,
+            "lengths": lengths,
+            "boundary": boundary,
+        }
+        combined_dataset.append(new_item)
+
+    return np.array(combined_dataset)
+
+
+## modified version of https://github.com/NVIDIA/NeMo/blob/main/nemo/utils/sequence_packing_utils.py#L178 for DPO
+## pack size should be at least 2*encoder_seq_length since the packed sequences include both the chosen and rejected sequences
+## for a given example
+def fill_packing_strategy(
+    assignments: List[List[int]], sequences: Dict[int, List[Dict]], pack_size: int
+) -> List[Dict]:
+    """
+    Fills the packing strategy with actual sequence data based on assignments and sequence information.
+
+    This function takes the assignments generated by the packing algorithm (containing sequence length indices),
+    the original sequences data, and the pack size. It iterates through the assignments, retrieves the corresponding
+    sequences from the sequences dictionary, and constructs the final output data structure with input IDs, loss masks
+    (if available), and starting indices for each sequence in a packed sequence.
+
+    Args:
+          assignments: A list of lists, where each inner list represents a bin and contains the indices of the
+                        sequence lengths assigned to that bin (output of 'create_packing_strategy').
+          sequences: A dictionary where keys are sequence lengths and values are lists of corresponding sequences
+                      from the dataset (output of 'create_hist').
+          pack_size: The maximum capacity of each bin.
+
+    Returns:
+          output_data: A list of dictionaries, where each dictionary represents a packed sequence with its input IDs,
+                        loss mask (if available), and starting indices.
+    """
+    ifile_handles = dict()
+    for seq_len in tqdm(range(pack_size + 1)):
+        per_seq_data = sequences[seq_len]
+        if len(per_seq_data) > 0:
+            perm = np.random.permutation(len(per_seq_data))
+
+            perm = np.random.permutation(len(per_seq_data))
+            input_ids = np.array([x["input_ids"] for x in per_seq_data])[perm].tolist()
+            labels = np.array([x["labels"] for x in per_seq_data])[perm].tolist()
+            reward = np.array([x["reward"] for x in per_seq_data])[perm].tolist()
+            lengths = np.array([x["lengths"] for x in per_seq_data])[perm].tolist()
+            boundary = np.array([x["boundary"] for x in per_seq_data])[perm].tolist()
+
+            ifile_handles[seq_len] = (input_ids, labels, reward, lengths, boundary)
+
+    input_ids, labels, reward, lengths, seq_boundaries = {}, {}, {}, {}, {}
+
+    for oindex, assignment in tqdm(enumerate(assignments), total=len(assignments)):
+        _input_ids, _labels, _reward, _lengths, _seq_boundaries = [], [], [], [], [0]
+
+        for seq_length in assignment:
+
+            previous_seq_len = len(_input_ids)
+
+            _input_ids.extend(ifile_handles[seq_length][0].pop())
+            _labels.extend(ifile_handles[seq_length][1].pop())
+            _reward.extend(ifile_handles[seq_length][2].pop())
+            _lengths.extend(ifile_handles[seq_length][3].pop())
+
+            ## store the boundaries for the chosen, rejected sequences
+            _seq_boundaries.append(previous_seq_len + ifile_handles[seq_length][4].pop())
+            _seq_boundaries.append(len(_input_ids))
+
+        input_ids[oindex] = _input_ids
+        labels[oindex] = _labels
+        reward[oindex] = _reward
+        lengths[oindex] = _lengths
+        seq_boundaries[oindex] = _seq_boundaries
+
+    output_data = []
+    for i in range(len(input_ids)):
+        item_dict = {
+            "input_ids": input_ids[i],
+            "labels": labels[i],
+            "reward": reward[i],
+            "lengths": lengths[i],
+            "seq_boundaries": seq_boundaries[i],
+        }
+        output_data.append(item_dict)
+
+    # (input_ids, labels, reward, lengths, boundary) = length 5
+    for i in range(5):
+        assert all(
+            not seq[i] for seq in ifile_handles.values()
+        ), "Error: There are items left over from the assignment"
+    return output_data
+
+
+@dataclass
+class PackingArgs:
+    output_dir: str = "output"
+    pack_sizes: Tuple[int] = (2048,)
+    packing_algorithm: str = "first_fit_shuffle"
+    tokenizer_type: str = "sentencepiece"  ## one of "huggingface" or "sentencepiece"
+
+    def from_config(self, cfg: "DictConfig"):
+        for required_arg in ("output_dir", "pack_sizes"):
+            assert cfg.get(required_arg, None), f"Please specify +{required_arg}=..."
+        self.output_dir = cfg.output_dir
+        self.pack_sizes = cfg.pack_sizes
+        self.packing_algorithm = cfg.get("packing_algorithm", "first_fit_shuffle")
+        self.tokenizer_type = cfg.tokenizer_type
+        return self
+
+
+@hydra_runner(config_path="../../gpt/conf", config_name="gpt_dpo")
+def main(cfg: "DictConfig") -> None:
+    args = PackingArgs().from_config(cfg)
+    dataset = tokenize_dataset(cfg, args.tokenizer_type)
+    sequences, histogram = create_hist(
+        dataset, 2 * cfg.model.data.seq_length
+    )  ## multiply by 2 because packed sequences include chosen and rejected
+    for pack_size in args.pack_sizes:
+        assignments = create_packing_strategy(histogram, pack_size, args.packing_algorithm)
+        output_data = fill_packing_strategy(assignments, sequences, pack_size)
+
+        # save output data
+        os.makedirs(args.output_dir, exist_ok=True)
+        output_path = os.path.join(args.output_dir, f"packed_{pack_size}_seed{cfg.model.get('seed', 1234)}.npy")
+        np.save(output_path, output_data)
+        logging.info(f"Done, output written to {output_path}")
+
+    logging.info(
+        f"""
+✅ Packed datasets with pack sizes {args.pack_sizes} are prepared successfully. 
+To train with packed sequences, you need to make changes to the DPO config file.
+See the NeMo-Aligner sequence packing documentation for more details:
+https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/dpo.rst#sequence-packing-with-dpo 
+"""
+    )
+
+
+if __name__ == "__main__":
+    main()