docs: 0.5.0 documentation updates (#346)

Signed-off-by: ashors1 <[email protected]> ci: Sign-off cherry pick (#366) Signed-off-by: Oliver Koenig <[email protected]> docs: main readme and sft docs (#367) Signed-off-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Gerald Shen <[email protected]> docs: fix code block rendering (#369) Signed-off-by: ashors1 <[email protected]> dpo and sft Signed-off-by: arendu <[email protected]> dpo support Signed-off-by: root <[email protected]> mamba padding Signed-off-by: arendu <[email protected]> convenience script to remove old format of DPO data Signed-off-by: adithyare <[email protected]> pad to mult 256 Signed-off-by: arendu <[email protected]> copy dpo style cfg overrides Signed-off-by: arendu <[email protected]> remove _modify_config Signed-off-by: arendu <[email protected]> fix config issue Signed-off-by: Jiaqi Zeng <[email protected]> fix mamba config issue Signed-off-by: Jiaqi Zeng <[email protected]> is mamba default false Signed-off-by: arendu <[email protected]>
NVIDIA · Nov 21, 2024 · bf8e6cc · bf8e6cc
1 parent 3ed1cb1
commit bf8e6cc
Show file tree

Hide file tree

Showing 19 changed files with 472 additions and 782 deletions.
diff --git a/.github/workflows/cherry-pick-release-commit.yml b/.github/workflows/cherry-pick-release-commit.yml
@@ -60,7 +60,7 @@ jobs:
             (
               git fetch origin $RELEASE_BRANCH:$RELEASE_BRANCH
               git switch --force-create cherry-pick-$PR_ID-$RELEASE_BRANCH $RELEASE_BRANCH
-              git cherry-pick $SHA
+              git cherry-pick --signoff $SHA
               git push -u origin --force cherry-pick-$PR_ID-$RELEASE_BRANCH
               git checkout ${CI_DEFAULT_BRANCH:-main}
             )

diff --git a/README.md b/README.md
@@ -42,6 +42,13 @@ For the latest stable release, please see the [releases page](https://github.com
 ### Requirements
 NeMo-Aligner has the same requirements as the [NeMo Toolkit Requirements](https://github.com/NVIDIA/NeMo#requirements) with the addition of [PyTriton](https://github.com/triton-inference-server/pytriton).
 
+### Quick start inside NeMo container
+NeMo Aligner comes included with NeMo containers. On a machine with NVIDIA GPUs and drivers installed run NeMo container:
+```bash
+docker run --gpus all -it --rm --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864  nvcr.io/nvidia/nemo:24.07
+```
+Once you are inside the container, NeMo-Aligner is already installed and together with NeMo and other tools can be found under ```/opt/``` folder.
+
 ### Install NeMo-Aligner
 Please follow the same steps as outlined in the [NeMo Toolkit Installation Guide](https://github.com/NVIDIA/NeMo#installation).  After installing NeMo, execute the following additional command:
 ```bash

diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst
@@ -5,25 +5,28 @@
 Model Alignment by DPO, RPO, and IPO
 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 
+.. note::
+   Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
+
 The NeMo Framework supports efficient model alignment via the NeMo-Aligner codebase.
 
-All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__.  The same tutorial also works for GPT models (such as LLaMa2) of any size.
+All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__.  The same tutorial also works for GPT models (such as LLaMa3) of any size.
 
 DPO with LoRA
 #############
 
 We support both full-parameter DPO training and LoRA DPO training. 
-For full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see `DPO paper <https://arxiv.org/pdf/2305.18290.pdf>`__). 
+In full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see the `DPO paper <https://arxiv.org/pdf/2305.18290.pdf>`__).
 For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights.
 
 RPO and IPO Variations
 #######################
 
-Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO).
+Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity Preference Optimization (IPO) and Reward-aware Preference Optimization (RPO).
 
 The algorithm is identified with the ``dpo.preference_loss`` config variable. We support three sorts of RPO algorithms based on the distance metric: ``rpo_sq`` for squared distance, ``rpo_bwd_kl`` for Bernoulli backward KL divergence, and ``rpo_fwd_kl`` for Bernoulli forward KL divergence.
 
-To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used.
+To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used.
 
 Obtain a Pretrained Model
 ############################
@@ -36,18 +39,18 @@ To start, we must first get a pretrained model to align. There are two models we
 
         #. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``.
         #. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``.
-        #. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/86b198ff93438d454f9c7f3550bcfb7d4e59feab/scripts/nlp_language_modeling/convert_nemo_gpt_to_mcore.py>`__.
+        #. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/0ec7e9090d3261b8ce81818b0555a204e50d814d/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py>`__.
             .. code-block:: bash 
 
                python convert_nemo_gpt_to_mcore.py \
                   --in-folder ./model_checkpoint \
                   --out-file ./mcore_gpt.nemo
 
-    .. tab-item:: LLaMa2 7B
+    .. tab-item:: LLaMa3 7B
         :sync: key2
 
-        #. Download the `Llama 2 7B LLM model and tokenizer <https://huggingface.co/meta-llama/Llama-2-7b>`__ into the models folder.
-        #. Convert the LLaMa2 LLM into ``.nemo`` format.
+        #. Download the `Llama 3 8B LLM model and tokenizer <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ into the models folder.
+        #. Convert the LLaMa3 LLM into ``.nemo`` format.
             .. code-block:: bash 
 
                python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
@@ -78,7 +81,7 @@ For best DPO training performance, it is recommended that you start with a SFT m
 DPO Model Training
 #####################
 
-Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
+Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
 
    {"prompt": "Which year was the Magna Carta signed?", "chosen_response": "1215", "rejected_response": "I refuse to answer this question."}
    {"prompt": "Please give me the name of a famous medieval painter.", "chosen_response": "Hieronymus Bosch", "rejected_response": "David Hockney"}
@@ -88,12 +91,12 @@ However, please be aware that most Megatron GPT models adhere to a strict format
    {"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nWhich year was the Magna Carta signed?\n<extra_id_1>Assistant\n", "chosen_response": "1215\n<extra_id_1>", "rejected_response": "I refuse to answer this question.\n<extra_id_1>"}
    {"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nPlease give me the name of a famous medieval painter.\n<extra_id_1>Assistant\n", "chosen_response": "Hieronymus Bosch\n<extra_id_1>", "rejected_response": "David Hockney\n<extra_id_1>"}
 
-Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one jsonl file in the format above for your training data and one jsonl for your validation data.
+Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one ``.jsonl`` file in the format above for your training data and one ``.jsonl`` for your validation data.
 
 Your JSONL file must contain at least as many samples as the Global Batch Size (GBS) you plan to use during training. For example, if GBS = 64, ensure that both your training and validation files include at least 64 samples. Using a file with fewer samples than the GBS will result in a crash.
 
 Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model.
-For the purposes of the following sections, we assume that your training jsonl file is located in ``/path/to/train_dpo_format.jsonl`` and your validation jsonl file is located in ``/path/to/valid_dpo_format.jsonl``.
+For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``.
 
 For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` corresponds to the beta parameter in the DPO paper.
 
@@ -196,7 +199,7 @@ All metrics will be grouped by either ``train/`` or ``val/`` in WandB, represent
 When it comes to ideal hyperparameters for DPO training, much will depend on the characteristics of your SFT or base/foundation model. Consequently, there are no one-size-fits-all parameters that will universally work in all cases.
 However, the following list is a brief overview of which hyperparameters we have perturbed for various model sizes and their effects:
 
-* global_batch_size: generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
-* epochs: highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
-* learning rate: we tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
-* ref_policy_kl_penalty: we generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.
+* global_batch_size: Generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
+* epochs: Highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
+* learning rate: We tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
+* ref_policy_kl_penalty: We generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.
diff --git a/docs/user-guide/draftp.rst b/docs/user-guide/draftp.rst
@@ -2,18 +2,20 @@
 
 .. _model-aligner-draftp:
 
-Fine-tuning Stable Diffusion with DRaFT+
+Fine-Tuning Stable Diffusion with DRaFT+
 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 
-In this tutorial, we will go through the step-by-step guide for fine-tuning Stable Diffusion model using DRaFT+ algorithm by NVIDIA. 
-DRaFT+ is an improvement over the `DRaFT <https://arxiv.org/pdf/2309.17400.pdf>`__ algorithm by alleviating the mode collapse and improving diversity through regularization. 
-For more technical details on the DRaFT+ algorithm, check out our technical blog.
+.. note::
+   Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
 
+In this tutorial, we will go through the step-by-step guide for fine-tuning a Stable Diffusion model using DRaFT+ algorithm by NVIDIA.
+DRaFT+ enhances the DRaFT `DRaFT <https://arxiv.org/pdf/2309.17400.pdf>`__ algorithm by mitigating mode collapse and improving diversity through regularization.
+For more technical details on the DRaFT+ algorithm, check out our technical blog.
 
-Data Input for running DRaFT+
+Data Input for Running DRaFT+
 #############################
 
-The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tarfile from a ``.txt``
+The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tar file from a ``.txt``
 file containing the prompts separated by new lines, such as following format::
 
     prompt1
@@ -35,7 +37,7 @@ Use the following script to download and save the prompts from the `Pick a pic <
             for caption in captions:
                 file.write(caption + '\n')
 
-You can then run the following snipet to convert it to a ``.tar`` file:
+You can then run the following snippet to convert it to a ``.tar`` file:
 
    .. code-block:: bash 
 
@@ -64,8 +66,8 @@ you can use the `conversion script <https://github.com/NVIDIA/NeMo/blob/main/exa
 DRaFT+ Training
 ###############
 
-To launch reward model training, you must have checkpoints for `UNet <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet>`__ and 
-`VAE <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae>`__ of a trained Stable Diffusion model and a checkpoint for the Reward Model. 
+To start reward model training, you need checkpoints for both the `UNet <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet>`__ and
+`VAE <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae>`__ components of a trained Stable Diffusion model, as well as a checkpoint for the Reward Model.
 
 .. tab-set::
 
@@ -167,7 +169,7 @@ To launch reward model training, you must have checkpoints for `UNet <https://hu
 
 
 .. note::
-   For more info on DRaFT+ hyperparameters please see the model config files (for SD and SDXL respectively):
+   For more information on DRaFT+ hyperparameters, please see the model config files (for SD and SDXL respectively):
    
     ``NeMo-Aligner/examples/mm/stable_diffusion/conf/draftp_sd.yaml``
     ``NeMo-Aligner/examples/mm/stable_diffusion/conf/draftp_sdxl.yaml``
@@ -179,10 +181,10 @@ Once you have completed fine-tuning Stable Diffusion with DRaFT+, you can run in
 and `sd_lora_infer.py <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/text_to_image/stable_diffusion/sd_lora_infer.py>`__  scripts from the NeMo codebase. The generated images with the fine-tuned model should have 
 better prompt alignment and aesthetic quality.
 
-User controllable finetuning with Annealed Importance Guidance (AIG)
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+User-controllable Fine-Tuning with Annealed Importance Guidance (AIG)
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
-AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and DRaFT-finetuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated `weight_type` strategies to interpolate between the base and finetuned model.
+AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and a DRaFT+ fine-tuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated ``weight_type`` strategies to interpolate between the base and fine-tuned model.
 
 .. tab-set::
     .. tab-item:: AIG on Stable Diffusion XL 

diff --git a/docs/user-guide/modelalignment.rsts b/docs/user-guide/modelalignment.rsts
@@ -1,2 +1,34 @@
+
+.. _model-aligner-intro:
+
 Model Alignment
 !!!!!!!!!!!!!!!
+
+Introduction
+############
+
+NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful. Users can perform end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource-efficient manner. For more technical details, please refer to our `paper <https://arxiv.org/abs/2405.01481>`__.
+
+The NeMo-Aligner toolkit is built using the `NeMo Toolkit <https://github.com/NVIDIA/NeMo>`__ which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross-compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
+
+The toolkit is currently in its early stages. We are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful, and reliable models.
+
+Get Started
+###########
+
+NeMo-Aligner comes preinstalled in NVIDIA NeMo containers. NeMo containers are launched concurrently with NeMo version updates.
+
+To get access to the container, log in to the NVIDIA GPU Cloud (NGC) platform or create a free NGC account here: `NVIDIA NGC <https://ngc.nvidia.com/signin>`__. Once you have logged in, you can get the container here: `NVIDIA NGC NeMo Framework <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`__.
+
+To use a pre-built container, run the following code:
+
+    .. code-block:: bash
+
+        docker run -it --gpus=all  --shm-size=8g --workdir /opt/NeMo-Aligner nvcr.io/nvidia/nemo:24.09
+
+    Please use the latest tag in the form yy.mm.(patch).
+
+.. note::
+   Some of the subsequent tutorials require accessing gated Hugging Face models. For details on how to access these models, refer to ``this document <https://docs.nvidia.com/nemo-framework/user-guide//latest/generaltips.html#working-with-hugging-face-models>``__.
+
+