From 6721e724154053d0ff2f46499caa57bc3cfc6fac Mon Sep 17 00:00:00 2001
From: Anna Shors <ashors@nvidia.com>
Date: Thu, 12 Dec 2024 11:52:01 -0800
Subject: [PATCH] docs: add more details on CP + SFT support (#447)

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
---
 CHANGELOG.md | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 93701d298..8a2a6eb54 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -18,7 +18,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 ## [Next Version]
 
 ### New Features and Optimizations
-- Added context parallel support for SFT. CP can be enabled by setting `model.context_parallel_size` in your config.
+- Added context parallel (CP) support for SFT. CP requires you to prepare your dataset using NeMo's [prepare_packed_ft_dataset.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py) script prior to training. Be sure to pass the context parallel size to this script, for example:
+
+   ```
+   python scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
+      model.data.train_ds.file_names=[/path/to/training.jsonl] \
+      model.data.train_ds.max_seq_length=2048 \
+      +tokenizer_path=/path/to/tokenizer \
+      +output_dir=/path/to/output_folder \
+      +pack_sizes=[2048,4096,8192] \
+      model.context_parallel_size=2
+   ```
+  CP can then be enabled in your training run by setting `model.context_parallel_size` in your config. Refer to the [SFT documentation](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/sft.rst#step-1-format-the-data)
+for more details on running `prepare_packed_ft_dataset.py` and on running SFT with a packed dataset.
 - Sequence packing is now supported when running DPO.
 - Added support for Knowledge Distillation with SFT. See the [tutorial](docs/user-guide/knowledge-distillation.rst) for details.
 - Added support for Megatron Core’s distributed optimizer, which can be configured using `++model.optim.name=mcore_distributed_optim`.