This project provides a guide to run NemoMegatron on AWS using a container from Nvidia GPU Cloud (NGC). The test cases in this case cover NemoMegatron for different model sizes: 126M, 5B, 40B and 175B parameters. The test cases can be executed on Slurm and use Nvidia Enroot and Nvidia Pyxis.
Table of contents:
- 1. Pre-requisites
- 2. Build AWS-optimized Nemo-Launcher image
- 3. Set-up the NemoMegatron environment
- 4. Prepare Input Data
- 5. Pre-training GPT3
- 6. Customizing Pre-Training
- 7. Pre-Training llama2
- 8. References
- 9. Authors / Reviewers
The following pre-requisites are needed to run this example:
- You are using p4de.24xlarge instances with A100 80GB or newer, with at least 80GB of memory per GPU.
- You have access to the base image NeMo Framework Training. To gain access to this image, go to Get Access to NeMo Framework to enroll to organization/team
ea-bignlp/ga-participant
. - Docker, Enroot and Pixys installed on the cluster and available on all nodes. It is assumed you are using a Custom AMI (example)
You will need to setup the following environment variables before running the scripts. :
export NEMO_VERSION=23.11
export REPO=aws-nemo-megatron
export TAG=$NEMO_VERSION
export TARGET_PATH=/fsx/nemo-launcher-$NEMO_VERSION # must be a shared filesystem
export TEST_CASE_PATH=/home/ec2-user/2.nemo-launcher # where you copy the test case or set to your test case path
export ENROOT_IMAGE=/fsx/${REPO}_${TAG}.sqsh
cd $TEST_CASE_PATH
You will retrieve the container image from Nvidia, build an optimized container for EFA and, convert it into an Enroot file so we can run it on our cluster.
- You have a registered account with Nvidia and can access NGC. Retrieve the NGC API key following instructions from Nvidia and request access here in order to be able to pull NeMo images.
- Configure NGC as shown below using the command below, when requested use
$oauthtoken
for the login and the API key from NGC fro the password.
docker login nvcr.io
- Copy the file
0.NemoMegatron-aws-optimized.Dockerfile
to the local directory and run the command below. Docker will retrieve the NemoMegatron container image from NGC then build an optimized container for AWS. This stage takes a few minutes and you can follow progress
docker build --progress plain -t ${REPO}:${TAG} -f 0.NemoMegatron-aws-optimized.Dockerfile .
- Convert the Docker container image to an Enroot squash file that will be stored in
/fsx
. This step takes a few minutes.
[[ -e $ENROOT_IMAGE ]] && rm $ENROOT_IMAGE ; /usr/bin/time enroot import -o $ENROOT_IMAGE dockerd://${REPO}:${TAG}
The Enroot squash file will be placed into the /fsx
directory, backed by FSx Lustre to provide high read throughput by multiple compute nodes upon job starts.
You will setup the target directory to host the configurations and requirements for NemoMegatron. It is assumed that your have an FSx for Lustre file system available to all nodes of your cluster via the mountpoint /fsx
. We follow the same logic as in the NemoMegatron Launcher documentation
- Create the target directory with the command below:
mkdir -p $TARGET_PATH
- Retrieve files from the container and place them in the target directory. You execute the container on your head-node for this task using Enroot start command.
cd $TARGET_PATH
enroot start --mount $TARGET_PATH:/workspace/mount_dir \
--env NVIDIA_VISIBLE_DEVICES=void \
$ENROOT_IMAGE \
cp -a /opt/NeMo-Megatron-Launcher/launcher_scripts /opt/NeMo-Megatron-Launcher/auto_configurator /opt/nemo-data-curator /workspace/mount_dir/
The NVIDIA_VISIBLE_DEVICES
variable is set to void
to prevent the process to check for the Nvidia driver presence (since we don't need GPUs here).
- Install the NemoMegatron requirements in a Python VirtualEnv by running the set of commands below.
cd $TARGET_PATH
sudo amazon-linux-extras install -y python3.8 # we need Python =>3.8
/usr/bin/python3.8 -m venv .venv
source .venv/bin/activate
pip3.8 install --upgrade pip setuptools
pip3.8 install -r <(curl -fsSL "https://raw.githubusercontent.com/NVIDIA/NeMo-Megatron-Launcher/${NEMO_VERSION%.*}/requirements.txt")
Next, you need to prepare the configuration files as follow:
- Review and update the partition name in the .yaml config file
$TEST_CASE_PATH/conf.template/cluster/bcm.yaml
. Here is a summary of the values.
Value | Default | Definition |
---|---|---|
partition |
null |
Slurm partition, same as a job queue |
account |
null |
Account if using accounting |
exclusive |
True |
The job has exclusive use the instances it runs on (no other job can take it) |
gpus_per_task |
null |
Number of instances of GPUs per job |
gpus_per_node |
8 |
Number of GPUs to use per node. This is set to 8 GPUs as for th p4d.24xlarge |
mem |
0 |
Requested memory (all) |
job_name_prefix |
"nemo-megatron-" |
Prefix for your job names |
gres |
"gpu:8" |
Generic resource scheduling |
srun_args |
"--no-container-mount-home" |
Arguments for the srun command (here for Pyxis) |
srun_args |
"-l" |
Arguments for the srun command to improve log verbosity. This flag prepends Slurm task number to lines of stdout/stderr. |
srun_args |
"--open-mode=append" |
Arguments for the srun command. When a Slurm job has multiple steps, each step appends (rather than truncate) to existing logs. |
stderr_to_stdout |
True |
Merge stderr and stdout |
- Copy all the .yaml config files
{conf.template/ => launcher_scripts/conf/}
and substitute environment variables as follows:
cp -v ${TEST_CASE_PATH}/conf.template/cluster/* ${TARGET_PATH}/launcher_scripts/conf/cluster/
envsubst < ${TEST_CASE_PATH}/conf.template/config.yaml > ${TARGET_PATH}/launcher_scripts/conf/config.yaml
The pre-training we're going to run uses the GPT2 tokenizer which requires you to download the vocabularies files:
mkdir -p $TARGET_PATH/data/bpe
curl -L https://huggingface.co/gpt2/raw/main/vocab.json > $TARGET_PATH/data/bpe/vocab.json
curl -L https://huggingface.co/gpt2/raw/main/merges.txt > $TARGET_PATH/data/bpe/merges.txt
That's all needed to pre-train with a mock dataset generated on-the-fly.
This section assumes that you went through the previous sections and 1/ retrieved and built the AWS optimized NemoMegatron container, 2/ setup the NemoMegatron environment, and 3/ download the vocabularies. Here you start a pre-training on a small model of 126M parameters, this serves as a quick sanity check.
-
Source the NemoMegatron environment created earlier.
source ${TARGET_PATH}/.venv/bin/activate
-
To pre-train a GPT3-126m on two instances with mock dataset, run the commands below to let :
cd $TARGET_PATH $TEST_CASE_PATH/1.bmk-pretrain-gpt3-126m.sh
-
Check the file
$TARGET_PATH/launcher_scripts/main.py
. Thelauncher_scripts/main.py
interacts with Slurm on our behalf to generate an.sbatch
file and submits it to Slurm. Nemo-launcher logs all the invocation commands, output, and error to$TARGET_PATH/results/<MODEL_SIZE>/
described below.$TARGET_PATH/results/gpt3_126m ├── gpt3_126m_hydra.yaml # The fully interpolated pre-training configuration ├── launcher_cmd.log # The full invocation command of launcher_scripts/main.py ├── launcher.log # Job id produced by the sbatch command ├── log-nemo-megatron-gpt3_126m_<JOB_ID>.out # Stdout of the pre-training Slurm job ├── nemo-megatron-gpt3_126m_submission.sh # .sbatch file generated and submitted by nemo-launcher └── results ├── cmd-args.log # The full invocation command of the pre-training script ├── events.out.tfevents.* # Tensorboard logs ├── git-info.log # The commit hash of the NeMO repo provided in the container. ├── hparams.yaml # Pre-training hyperparameters ├── lightning_logs.txt # Additional logs from PyTorch-Lightning ├── nemo_error_log.txt # Stderr of pre-training step └── nemo_log_globalrank-*.txt # Log of each rank
Please note that except for
log-nemo-megatron-gpt3_126m_<JOB_ID>.out
, the other files will be overridden when you launch another pre-training of that same model size. To completely separate the output among jobs, run the script in benchmark mode:BMK_MODE=1 $TEST_CASE_PATH/bmk-pretrain-gpt3-126m.sh
which produces output dir$TARGET_PATH/results-<YYYYMMDD>-<HHMMSS>utc-<RANDOM_STR>/gpt3_126m/
. -
You can use Slurm command
squeue
to monitor the job status in the queue. The ample output below shows anemo-megatron
job with job id1234
is in running state (ST
=R
). A queued job will have stateST
=PD
(pending). Please refer to the complete of job states in this Slurm documentation.JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1234 my-cluste nemo-mega ec2-user R 00:19:40 1 p4de-dy-p4de-24xlarge-[1-2]
-
Once a job finishes, check the
log-nemo-megatron-<MODEL_NAME>_<MODEL_SIZE>_<JOB_ID>.err
, and see it should containsTrainer.fit` stopped: `max_steps=40` reached
(disregard the warnings).$ tail -5 $TARGET_PATH/results/gpt3_126m/log-nemo-megatron-gpt3_126m_72.err [NeMo W 2023-09-11 22:31:45 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('consumed_samples', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn( `Trainer.fit` stopped: `max_steps=40` reached.
-
Review the output file (
log-nemo-megatron-gpt3_126m_<JOB_ID>.out
) which contains thestdout
output of the job. The end of the file should be similar to the snippet below[NeMo I 2023-09-11 22:31:28 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f8ffd427490>" will be used during training (effective maximum steps = 40) - Parameters : (warmup_steps: 636 constant_steps: 100000 min_lr: 6.0e-05 max_steps: 40 ) Epoch 0: 100%|██████████| 40/40 [00:31<00:00, 1.27it/s, loss=10.9, v_num=, reduced_train_loss=10.90, global_step=39.00, consumed_samples=9984.0]
Congratulations! You've successfully run this test case to completion.
Note: Should you run into an OOM error, you can adjust the minimum batch size by setting the MBS in
bmk
launch scripts. You can tune the NemoMegatron and PyTorch parameters in such way as well.
To pre-train for a different model size on different instance count, open $TEST_CASE_PATH/1.bmk-pretrain-gpt3-126m.sh
and edit section 000
to choose the right hyperparameters. Be aware that pre-training LLM requires understanding on the hyperparameters such as parallelism and batches. Please refer to the NeMO project (website, GitHub, NeMo-Megatron-Launcher) and the Megatron papers (Shoeybi20, Narayanan21).
At the very least, you'd want to review and customize one or more YAML files under $TARGET_PATH/launcher_scripts/conf/
. Nemo-launcher organizes its config files in an opinionated hierarchy. Below is an example of relevant YAML files when launching $TARGET_PATH/launcher_scripts/main.py
for training
stage for gpt3/126m
(see $TEST_CASE_PATH/1.bmk-pretrain-gpt3-126m.sh
).
$TARGET_PATH/launcher_scripts/conf
├── config.yaml # Config for generating job scripts (.sbatch, .yaml, etc.)
├── cluster
│ └── bcm.yaml # Config for Slurm jobs
└── training # Config for stage "training"
└── gpt3 # Config for model "gpt3"
└── 126m.yaml # Config for model size "126m"
You can edit directly the gpt3/<MODEL_SIZE>.yaml
to customize the number of instances, tensor parallelism, pipeline parallelism, batch sizes (micro and global), experiment tracking, etc. on this file. Alternatively, you can override the settings through the CLI options of $TARGET_PATH/launcher_scripts/main.py
(refer to 1.bmk-pretrain-gpt3-126m.sh
). For example, this CLI arg training.trainer.num_nodes=$NUM_NODES
is equivalent to editing file $TARGET_PATH/launcher_scripts/training_scripts/conf/training/<MODEL_NAME>/<MODEL_SIZE>.yaml
to set key trainer -> num_nodes
to $NUM_NODES
.
+-- file `training/<MODEL_NAME>/<MODEL_SIZE>.yaml` under `$TARGET_PATH/launcher_scripts/conf`
|
/---+--\
training.trainer.num_nodes=$NUM_NODES
\_______________/
|
└── key 'trainer -> num_nodes' in the `.yaml` file.
This section assumes that you went through the previous sections and 1/ retrieved and built the AWS optimized NemoMegatron container, 2/ setup the NemoMegatron environment, and 3/ download the vocabularies. Actions will be almost the same as for 5/ Pre-training GPT3, let do it.
- Download llama2 tokenizer
mkdir -p $TARGET_PATH/data/llama2
curl -L https://github.com/microsoft/Llama-2-Onnx/raw/main/tokenizer.model > $TARGET_PATH/data/llama2/tokenizer.model
-
Source the NemoMegatron environment created earlier.
source ${TARGET_PATH}/.venv/bin/activate
-
To pre-train a llama2-7b on two instances with mock dataset, run the commands below to let :
cd $TARGET_PATH $TEST_CASE_PATH/5.bmk-pretrain-llama-7b.sh
-
Next stests are absolutely the same as for 5/ Pre-training GPT3, the only difference is that result directory is
$TARGET_PATH/results/llama2_7b
- Nvidia NemoMegatron Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/megatron.html
- Train Large Scale NLP with Nemo Megatron from Nvidia: https://docs.nvidia.com/launchpad/ai/base-command-nemo/latest/index.html
- [A] Verdi March - marcverd@
- [R] Pierre-Yves Aquilanti - pierreya@