Unable to start reward-model-and-critic-server #109

pratikkumar018 · 2024-02-20T17:11:52Z

pratikkumar018
Feb 20, 2024

Hi Team,
I am trying to run PPO.
I am following this user guide. https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst

My have written similar code as here. https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst#launching-the-reward-model-and-critic-server
But I am getting the below error. Can you please help.

tensorstore version : tensorstore==0.1.45

gshennvm · 2024-02-21T19:40:00Z

gshennvm
Feb 21, 2024
Maintainer

hello! when we run the critic we load with strict=True which forces the RM head to exist when launching the server.

Will you be able to give me more details on the checkpoint you're trying to start the critic server with? if you have a .nemo file you can untar it and look into the model_weights folder. Do you see model.rm_head.* inside the model weights folder?

9 replies

pratikkumar018 Feb 27, 2024
Author

Sure. Will try this. Thanks a lot.
Meanwhile if you can share code for converting huggingface reward model to .nemo model that would be great. We have some huggingface based reward model which we want to use in the PPO pipeline.

odelalleau Feb 27, 2024
Maintainer

I created an issue (#115) to keep track of this feature request.

pratikkumar018 Feb 28, 2024
Author

I am still unable to run with the given .nemo file (NV-Llama2-13B-RLHF-RM/Llama2-13B-RLHF-RM.nemo ).
I m running this command https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/rlhf.html#launching-the-reward-model-and-critic-server
I am getting the below error. Is there anything I can do to fix this?

I am using docker image.

odelalleau Feb 29, 2024
Maintainer

I am using docker image.

Which docker image exactly?

pratikkumar018 Mar 1, 2024
Author

I am using nemofw-training:24.01 container

pip freeze gives the below modules.

accelerate==0.27.2
addict==2.4.0
aiohttp @ file:///rapids/aiohttp-3.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=81b77f868814346662c96ab36b875d7814ebf82340d3284a31681085c051320f
aiosignal @ file:///rapids/aiosignal-1.3.1-py3-none-any.whl#sha256=f8376fb07dd1e86a584e4fcdec80b36b7f81aac666ebc724e2c090300dd83b17
albumentations==1.3.1
aniso8601==9.0.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
apex @ file:///opt/apex
appdirs==1.4.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
asciitree==0.3.3
asttokens==2.4.1
astunparse==1.6.3
async-timeout @ file:///rapids/async_timeout-4.0.3-py3-none-any.whl#sha256=7405140ff1230c310e51dc27b3145b9092d659ce68ff733fb0cefe3ee42be028
attrs==23.2.0
audioread==3.0.1
awscli==1.32.42
basicsr==1.4.2
beautifulsoup4==4.12.3
best-download==0.1.2
black==20.8b1
bleach==6.1.0
blinker==1.7.0
blis==0.7.11
bokeh==3.3.4
boto3==1.34.42
botocore==1.34.42
braceexpand==0.1.7
Brotli==1.1.0
cachetools==5.3.2
catalogue==2.0.10
certifi==2023.11.17
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
clip @ git+https://github.com/openai/CLIP.git@a1d071733d7111c9c014f024669f959182114e33
cloudpathlib==0.16.0
cloudpickle @ file:///rapids/cloudpickle-3.0.0-py3-none-any.whl#sha256=246ee7d0c295602a036e86369c77fecda4ab17b506496730f2f576d9016fd9c7
cmake==3.28.1
colorama==0.4.4
comm==0.2.1
comment-parser==1.2.4
confection==0.1.4
contourpy==1.2.0
cubinlinker @ file:///rapids/cubinlinker-0.3.0%2B2.g405ac64-cp310-cp310-linux_x86_64.whl#sha256=fe3ba53922377d7656ef45cb5aa61ac10fc4f44635f94d261cb01dbc2ed6b6c2
cuda-python @ file:///rapids/cuda_python-12.3.0rc4%2B9.gdb8c48a.dirty-cp310-cp310-linux_x86_64.whl#sha256=40ec85ddb721b09a0af7bb545af238feabd8ac4c610756e89d43891a34b3ad62
cudf @ file:///rapids/cudf-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=9bf23765b34ef0a453e5caf63be526efbaf338f1dc6339cdeb4ea74404c81254
cudf-cu12==23.10.2
cugraph @ file:///rapids/cugraph-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=18c29a3c7c96ac6bb3e86c149667f15ced14c6cb812b008fd1ca4f6cd92c95a2
cugraph-cu12==23.10.0
cugraph-dgl @ file:///rapids/cugraph_dgl-23.12.0-py3-none-any.whl#sha256=ecc4e14a1b586ff6054829a94b54596111ca9e0514e8ad157a99b59e5408e28d
cugraph-service-client @ file:///rapids/cugraph_service_client-23.12.0-py3-none-any.whl#sha256=decbbd260b254d397887af5b10cc21c55b845b9776f96da9fd587ae872362728
cugraph-service-server @ file:///rapids/cugraph_service_server-23.12.0-py3-none-any.whl#sha256=9e52401f6e5acd4d5c85f502cc763c60cb80a175d171b13392bec6c6d75ecd82
cuml @ file:///rapids/cuml-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=0e7e87f320bd91705df559dd383279317a5a88fb18f5c58b54972d27882d9e1b
cupy-cuda12x @ file:///rapids/cupy_cuda12x-12.3.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=32d0e03789ef3f02f0c098818e957c235b75c1636e9e0036299480db0c423dcd
cycler==0.12.1
cymem==2.0.8
Cython==3.0.8
dask==2023.9.2
dask-cuda==23.10.0
dask-cudf @ file:///rapids/dask_cudf-23.12.0-py3-none-any.whl#sha256=56d03008fee5660f479e59436f1ab54e36c75bd214e65f31c49a3c6fad7d83d7
dask-cudf-cu12==23.10.2
dask-mpi==2022.4.0
dataclasses==0.6
DataProperty==1.0.1
datasets==2.17.0
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
diffusers==0.19.3
dill==0.3.8
distributed==2023.9.2
dm-tree==0.1.8
docker-pycreds==0.4.0
docopt==0.6.2
docutils==0.16
editdistance==0.8.1
einops==0.7.0
einops-exts==0.0.4
exceptiongroup==1.2.0
execnet==2.0.2
executing==2.0.1
ExifRead-nocycle==3.0.1
expecttest==0.1.3
faiss-cpu==1.7.4
fasteners==0.19
fastjsonschema==2.19.1
fastrlock @ file:///rapids/fastrlock-0.8.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl#sha256=08315bde19d0c2e6b06593d5a418be3dc8f9b1ee721afa96867b9853fceb45cf
fasttext==0.9.2
filelock==3.13.1
fire==0.5.0
flash-attn==2.4.2
Flask==3.0.2
Flask-RESTful==0.3.10
flatbuffers==23.5.26
fonttools==4.47.2
freqencoder @ file:///tmp/stable-dreamfusion/freqencoder
frozenlist @ file:///rapids/frozenlist-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=a9b2de4cf0cdd5bd2dee4c4f63a653c61d2408055ab77b151c1957f221cabf2a
fsspec==2023.10.0
ftfy==6.1.1
future==0.18.3
gast==0.5.4
gdown==5.1.0
gevent==24.2.1
geventhttpclient==2.0.2
gitdb==4.0.11
GitPython==3.1.41
google-auth==2.26.2
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
graphsurgeon @ file:///workspace/TensorRT-8.6.1.6/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl#sha256=0fbadaefbbe6e9920b9f814ae961c4a279be602812edf3ed7fb9cc6f8f4809fe
greenlet==3.0.3
gridencoder @ file:///tmp/stable-dreamfusion/gridencoder
grpcio==1.60.0
h5py==3.10.0
huggingface-hub==0.20.3
hydra-core==1.2.0
hypothesis==5.35.1
idna==3.6
ijson==3.2.3
imageio==2.34.0
img2dataset==1.45.0
importlib-metadata @ file:///rapids/importlib_metadata-7.0.1-py3-none-any.whl#sha256=4805911c3a4ec7c3966410053e9ec6a1fecd629117df5adee56dfc9432a1081e
in-place==0.5.0
inflect==7.0.0
iniconfig==2.0.0
install==1.3.5
intel-openmp==2021.4.0
ipadic==1.0.0
ipykernel==6.29.0
ipython==8.20.0
ipython-genutils==0.2.0
itsdangerous==2.1.2
jedi==0.19.1
jieba==0.42.1
Jinja2==3.1.3
jiwer==3.0.3
jmespath==1.0.1
joblib==1.3.2
json5==0.9.14
jsonlines==2.0.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyterlab==2.3.2
jupyterlab-server==1.2.0
jupyterlab_pygments==0.3.0
jupytext==1.16.1
jusText==3.0.0
keras-nightly==3.0.4.dev2024021403
kiwisolver==1.4.5
kornia==0.6.0
langcodes==3.3.0
lazy_loader==0.3
libclang==16.0.6
librosa==0.10.1
lightning-utilities==0.10.1
llvmlite @ file:///rapids/llvmlite-0.40.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=bbd5e82cc990e5a3e343a3bf855c26fdfe3bfae55225f00efd01c05bbda79918
lm-dataformat==0.0.19
lmdb==1.4.1
locket @ file:///rapids/locket-1.0.0-py2.py3-none-any.whl#sha256=b6c819a722f7b6bd955b80781788e4a66a55628b858d347536b7e81325a3a5e3
lxml==5.1.0
lz4==4.3.3
Markdown==3.5.2
markdown-it-py==3.0.0
markdown2==2.4.12
MarkupSafe==2.1.4
matplotlib==3.8.2
matplotlib-inline==0.1.6
mbstrdecoder==1.1.3
mdit-py-plugins==0.4.0
mdurl==0.1.2
## !! Could not determine repository location
-e /opt/megatron-lm
mistune==3.0.2
mkl==2021.1.1
mkl-devel==2021.1.1
mkl-include==2021.1.1
ml-dtypes==0.3.2
mock==4.0.3
mpi4py==3.1.5
mpmath==1.3.0
msgfy==0.2.1
msgpack==1.0.7
multidict @ file:///rapids/multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=36c63aaa167f6c6b04ef2c85704e93af16c11d20de1d133e39de6a0e84582a93
multiprocess==0.70.16
murmurhash==1.0.10
mwparserfromhell @ git+https://github.com/earwig/mwparserfromhell.git@0f89f4426bdd9e184ae8c8223672a7a0bf36eb76
mypy-extensions==1.0.0
namex==0.0.7
nbclient==0.9.0
nbconvert==7.14.2
nbformat==5.9.2
## !! Could not determine repository location
-e /opt/nemo-data-curator
-e git+https://github.com/NVIDIA/NeMo-Aligner.git@e5b3ad30a350ca10fadf80770aca00561d1c7f12#egg=nemo_aligner
-e git+https://github.com/NVIDIA/NeMo.git@98186c2e2746139aec71fbc4cbc0b3cd24e03e8b#egg=nemo_toolkit
nerfacc==0.5.3
nest-asyncio==1.5.9
networkx==3.2.1
ninja==1.11.1.1
nltk==3.8.1
notebook==6.4.10
numba @ file:///rapids/numba-0.57.1%2B1.g1ff679645-cp310-cp310-linux_x86_64.whl#sha256=182b77614c983c4c32db619d849a68ed4c33637e307ebb1a2731a3ae730ae36c
numcodecs==0.12.1
numexpr==2.9.0
numpy==1.24.4
nvdiffrast @ git+https://github.com/NVlabs/nvdiffrast.git@c5caf7bdb8a2448acc491a9faa47753972edd380
nvfuser==0.1.1+gitunknown
nvidia-ammo @ file:///opt/ammo/dist/nvidia_ammo-0.0.0-cp310-cp310-linux_x86_64.whl#sha256=807df7f48b7c433d69bd41338c7155ce7bc9a7c132114275b49c0e1a21b7fd03
nvidia-dali-cuda120==1.33.0
nvidia-pyindex==1.0.9
nvidia-pytriton==0.4.1
nvtx @ file:///rapids/nvtx-0.2.5-cp310-cp310-linux_x86_64.whl#sha256=939c7322e7cd4f34af85cdf6468b3d80b1e144a34bbcd61e08e5c436071d3e1f
oauthlib==3.2.2
omegaconf==2.2.3
onnx @ file:///opt/pytorch/pytorch/third_party/onnx
open-clip-torch==2.24.0
OpenCC==1.1.6
opencv @ file:///opencv-4.7.0/modules/python/package
opencv-python==4.9.0.80
opencv-python-headless==4.9.0.80
opt-einsum==3.3.0
optree==0.10.0
packaging==23.2
pandas @ file:///rapids/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=7a0a56cef15fd1586726dace5616db75ebcfec9179a3a55e78f72c5639fa2a23
pandocfilters==1.5.1
pangu==4.0.6.1
parso==0.8.3
partd @ file:///rapids/partd-1.4.1-py3-none-any.whl#sha256=27e766663d36c161e2827aa3e28541c992f0b9527d3cca047e13fb3acdb989e6
pathlib_abc==0.1.1
pathspec==0.12.1
pathtools==0.1.2
pathvalidate==2.5.2
pathy==0.11.0
pexpect==4.9.0
phonenumbers==8.13.30
Pillow==9.3.0
platformdirs==4.1.0
pluggy==1.3.0
ply @ file:///rapids/ply-3.11-py2.py3-none-any.whl#sha256=096f9b8350b65ebd2fd1346b12452efe5b9607f7482813ffca50c22722a807ce
polygraphy==0.49.1
pooch==1.8.0
portalocker==2.8.2
prefetch-generator==1.0.3
preshed==3.0.9
presidio-analyzer==2.2.351
presidio-anonymizer==2.2.351
prettytable==3.9.0
probableparsing==0.0.1
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.24.4
psutil @ file:///rapids/psutil-5.9.4-cp310-abi3-linux_x86_64.whl#sha256=f1cb87a01694756b49d74098db4073e7b50588d5c41c47485d677ef2bf07f132
ptxcompiler @ file:///rapids/ptxcompiler-0.8.1%2B2.g0d406d6-cp310-cp310-linux_x86_64.whl#sha256=4d53fe48aa72600d059e402fd468f51b14301b11cbbedd6740637bec4add0944
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
pyarrow==12.0.1
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.8.0
pybind11-global==2.11.1
pycld2==0.41
pycocotools @ git+https://github.com/nvidia/cocoapi.git@d99cbf3823588ef09a2721655f46e509ebafb3d7#subdirectory=PythonAPI
pycountry==20.7.3
pycparser==2.21
pycryptodome==3.20.0
pydantic==1.8.2
pydantic_core==2.14.6
Pygments==2.17.2
pylibcugraph @ file:///rapids/pylibcugraph-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=07ba411e9cffd1dac341a42d8ed2962fcee94a5219fdd602fa122d73dee4aaaf
pylibcugraph-cu12==23.10.0
pylibcugraphops @ file:///rapids/pylibcugraphops-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=60670e596324588a01fb670e030293f06dc5cf7f8d6006e910b8e00df564d683
pylibraft @ file:///rapids/pylibraft-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=fbcfaa07a175dd0fdd7b65011dc72cbb6b88aaddc156843250ccb7d1c181916a
pylibraft-cu12==23.10.0
PyMCubes==0.1.4
pymeshlab==2023.12.post1
pynvml @ file:///rapids/pynvml-11.4.1-py3-none-any.whl#sha256=d27be542cd9d06558de18e2deffc8022ccd7355bc7382255d477038e7e424c6c
pyparsing==3.1.1
PySocks==1.7.1
pytablewriter==0.58.0
pytest==6.2.5
pytest-flakefinder==1.1.0
pytest-rerunfailures==13.0
pytest-shard==0.1.2
pytest-xdist==3.5.0
python-crfsuite==0.9.10
python-dateutil==2.8.2
python-hostlist==1.23.0
python-magic==0.4.24
python-rapidjson==1.14
pytorch-lightning==2.0.7
pytorch-quantization==2.1.2
pytz @ file:///rapids/pytz-2023.3.post1-py2.py3-none-any.whl#sha256=ce42d816b81b68506614c11e8937d3aa9e41007ceb50bfdcb0749b921bf646c7
PyYAML==6.0.1
pyzmq==23.2.1
qudida==0.0.4
raft-dask @ file:///rapids/raft_dask-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=d632376e71ac9cfca5eacc7f8aa51e0f096e7a1f56c186a1653e097ea990cfe9
raft-dask-cu12==23.10.0
rapidfuzz==3.6.1
rapids-dask-dependency @ file:///rapids/rapids_dask_dependency-23.12.1-py3-none-any.whl#sha256=2abfe15415711bad9dfe9e83d4bfbd039e9436d66cc17e74ae22c85ab9afe46b
raymarching @ file:///tmp/stable-dreamfusion/raymarching
redis==4.3.4
referencing==0.32.1
regex==2023.12.25
requests==2.31.0
requests-file==2.0.0
requests-oauthlib==1.3.1
rich @ file:///rapids/rich-13.7.0-py3-none-any.whl#sha256=6da14c108c4866ee9520bbffa71f6fe3962e193b7da68720583850cd4548e235
rmm @ file:///rapids/rmm-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=d59676daa42bcdd9d3b47d8aa96ea43d15c4120c005e6f7d8a2cbfa4a1e2d840
rmm-cu12==23.10.0
rouge-score==0.1.2
rpds-py==0.17.1
rsa==4.7.2
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3transfer==0.10.0
sacrebleu==1.5.0
sacremoses==0.1.1
safetensors==0.4.2
scikit-image==0.22.0
scikit-learn @ file:///rapids/scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=184a42842a4e698ffa4d849b6019de50a77a0aa24d26afa28fa49c9190bb144b
scipy @ file:///rapids/scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=5e32847e08da8d895ce09d108a494d9eb78974cf6de23063f93306a3e419960c
seaborn==0.13.2
Send2Trash==1.8.2
sentence-transformers==2.3.1
sentencepiece==0.1.99
sentry-sdk==1.40.4
setproctitle==1.3.3
sh==1.14.3
shellingham==1.5.4
shencoder @ file:///tmp/stable-dreamfusion/shencoder
six==1.16.0
smart-open==6.4.0
smmap==5.0.1
sortedcontainers==2.4.0
soundfile==0.12.1
soupsieve==2.5
soxr==0.3.7
spacy==3.1.3
spacy-legacy==3.0.12
spacy-loggers==1.0.5
sphinx-glpi-theme==0.5
sqlitedict==1.6.0
srsly==2.4.8
stack-data==0.6.3
sympy==1.12
tabledata==1.3.3
tabulate==0.9.0
-e git+https://github.com/CompVis/taming-transformers.git@3ba01b241669f5ade541ce990f7650a3b8f65318#egg=taming_transformers
tb-nightly==2.16.0a20240212
tbb==2021.11.0
tblib @ file:///rapids/tblib-3.0.0-py3-none-any.whl#sha256=80a6c77e59b55e83911e1e607c649836a69c103963c5f28a46cbeef44acf8129
tcolorpy==0.1.4
tensorboard==2.9.0
tensorboard-data-server==0.7.2
tensorboard-plugin-wit==1.8.1
tensorflow-io-gcs-filesystem==0.36.0
tensorrt @ file:///workspace/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl#sha256=2684b4772cb16088184266728a0668f5dac14e66f088c4ccff2096ccb222d74c
tensorstore==0.1.45
termcolor==2.4.0
terminado==0.18.0
text-unidecode==1.3
tf-nightly==2.17.0.dev20240214
tf_keras-nightly==2.17.0.dev2024021422
thinc==8.0.17
threadpoolctl==3.2.0
thriftpy2 @ file:///rapids/thriftpy2-0.4.17-cp310-cp310-linux_x86_64.whl#sha256=9e3633fc2abf0a2be59f6e4cd2a1dfac1b1daf3b1950383476fc6d6de6efcd03
tifffile==2024.2.12
timm==0.9.12
tinycss2==1.2.1
tinycudann @ git+https://github.com/NVlabs/tiny-cuda-nn@6f018a9cd1b369bcb247e1d539968db8e48b2b3f#subdirectory=bindings/torch
tldextract==5.1.1
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
toolz @ file:///rapids/toolz-0.12.1-py3-none-any.whl#sha256=d22731364c07d72eea0a0ad45bafb2c2937ab6fd38a3507bf55eae8744aa7d85
torch @ file:///tmp/pip/torch-2.2.0a0%2B81ea7a4-cp310-cp310-linux_x86_64.whl#sha256=1bd2b01e3a5798dea576e08b54a87eea1b1083bb4403ace296ffb3737a91e2dc
torch-ema==0.3
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/dist/torch_tensorrt-2.2.0a0-cp310-cp310-linux_x86_64.whl#sha256=641f4b764d14dd478acc1490f9626495f53bc36ed15ea37c19b77edb4f5f4691
torchdata @ file:///opt/pytorch/data
torchdiffeq==0.2.3
torchmetrics==0.9.1
torchprofile==0.0.4
torchsde==0.2.6
torchtext @ file:///opt/pytorch/text
torchvision @ file:///opt/pytorch/vision
tornado==6.4
tqdm==4.62.3
tqdm-multiprocess==0.0.11
traitlets==5.9.0
trampoline==0.1.2
transformer-engine @ file:///opt/TransformerEngine
transformers==4.37.2
treelite @ file:///rapids/treelite-3.9.1-cp310-cp310-linux_x86_64.whl#sha256=ad238ce625336335bf51b9fd4b3c64b42a1bfc743d17f6077ec5dc7c96644511
treelite-runtime @ file:///rapids/treelite_runtime-3.9.1-cp310-cp310-linux_x86_64.whl#sha256=1379f600b91df775aa24ea255f5e31ca47788f76ae14b73f46b4b8b0e4728a33
trimesh==4.1.3
triton @ file:///tmp/dist/triton-2.1.0%2B6e4932c-cp310-cp310-linux_x86_64.whl#sha256=ad0816d9c3d9e5cbd84372fe21040527c2a347938e30506e0c6e08a5a4212f3b
tritonclient==2.42.0
typed-ast==1.5.5
typepy==1.3.2
typer==0.4.2
types-dataclasses==0.6.6
typing-inspect==0.6.0
typing_extensions==4.9.0
ucx-py @ file:///rapids/ucx_py-0.35.0-cp310-cp310-linux_x86_64.whl#sha256=c193b737773989d184121dbfab320c888df6a60879f15cd885a8a3274a610273
ucx-py-cu12==0.34.0
uff @ file:///workspace/TensorRT-8.6.1.6/uff/uff-0.6.9-py2.py3-none-any.whl#sha256=618a3f812d491f0d3c4f2e38b99e03217ca37b206db14cee079f2bf681eb4fe3
ujson==5.9.0
unidic-lite==1.0.8
urllib3==2.2.0
usaddress==0.5.10
wandb==0.15.3
warcio==1.7.4
wasabi==0.10.1
wcwidth==0.2.13
weasel==0.3.4
webdataset==0.2.48
webencodings==0.5.1
Werkzeug==3.0.1
wget==3.2
wrapt==1.16.0
xatlas==0.0.9
xdoctest==1.0.2
xgboost @ file:///rapids/xgboost-1.7.6-cp310-cp310-linux_x86_64.whl#sha256=275613a32b6ef56d0fda43f1ad847afd9e5c8eb58a85208b1cb2871ea2286088
xxhash==3.4.1
xyzservices==2023.10.1
yapf==0.40.2
yarl @ file:///rapids/yarl-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=357495293086c5b6d34ca9616a43d329317feab7917518bc97a08f9e55648455
youtokentome==1.0.6
zarr==2.17.0
zict @ file:///rapids/zict-3.0.0-py2.py3-none-any.whl#sha256=5796e36bd0e0cc8cf0fbc1ace6a68912611c1dbd74750a3f3026b9b9d6a327ae
zipp @ file:///rapids/zipp-3.17.0-py3-none-any.whl#sha256=0e923e726174922dce09c53c59ad483ff7bbb8e572e00c7f7c46b88556409f31
zope.event==5.0
zope.interface==6.1
zstandard==0.17.0

gshennvm · 2024-03-04T22:00:26Z

gshennvm
Mar 4, 2024
Maintainer

I'm unable to repro your bug.

I used the nvcr.io/nvidia/nemo:24.01.framework container, and downloaded our RM using wget https://huggingface.co/nvidia/NV-Llama2-13B-RLHF-RM/resolve/main/Llama2-13B-RLHF-RM.nemo?download=true

and then I ran this script

#!/bin/bash
CHECKPOINT_NEMO_FILE="/rlhf/llama2_rlhf_rm.nemo"
GPFS="/aligner"

export NVTE_FLASH_ATTN=0
export NVTE_FUSED_ATTN=0
export NVTE_MASKED_SOFTMAX_FUSION=0
export NVTE_APPLY_QK_LAYER_SCALING=1

RESULTS_DIR="/rlhf/critic_results"

export PYTHONPATH="${GPFS}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python -u ${GPFS}/examples/nlp/gpt/serve_ppo_critic.py \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=1 \
    ++model.pipeline_model_parallel_size=1 \
    ++model.virtual_pipeline_model_parallel_size=1 \
    exp_manager.explicit_log_dir=${RESULTS_DIR} \
    ++pretrained_checkpoint.restore_from_path=${CHECKPOINT_NEMO_FILE} \
    ++model.megatron_amp_O2=True \
    ++model.activations_checkpoint_granularity=null \
    ++trainer.ppo.combine_rm_and_critic_server=True \
    ++model.offload_adam_states=True \
    ++model.global_batch_size=2 \
    ++model.forward_mbs=1 \
    ++model.micro_batch_size=1 \
    trainer.precision=bf16 \
    ++model.mcore_gpt=True

can you give these exact instructions a try and see if you hit the same error? also let me know what pip show megatron-core outputs

8 replies

pratikkumar018 Mar 7, 2024
Author

Yes mine md5sum is 8794fb021391067217cf0179a90eb09a only

pratikkumar018 Mar 11, 2024
Author

Hi @gshennvm ,
I was able to run this script with when I pulled the docker image again.
My hardware is 8*80GB A100(only 1 node). Is it sufficient to train Llama13B using the 13B nemo reward model?
I using these two scripts

critic server
`#!/bin/bash

CHECKPOINT_NEMO_FILE="/workspace/Llama2-13B-RLHF-RM.nemo"
GPFS="/opt/NeMo-Aligner"
export CUDA_VISIBLE_DEVICES=0,1,2,3
RESULTS_DIR="critic_results_dir"
CRITIC_PORT=5567
export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u ${GPFS}/examples/nlp/gpt/serve_ppo_critic.py
trainer.devices=4
trainer.num_nodes=1
++model.tensor_model_parallel_size=1
++model.pipeline_model_parallel_size=1
exp_manager.create_wandb_logger=False
exp_manager.explicit_log_dir=${RESULTS_DIR}
trainer.ppo.inference_micro_batch_size=1
trainer.ppo.port=${CRITIC_PORT}
++pretrained_checkpoint.restore_from_path=${CHECKPOINT_NEMO_FILE}
++model.micro_batch_size=1
++model.megatron_amp_O2=True
++model.activations_checkpoint_granularity=null
++trainer.ppo.combine_rm_and_critic_server=True
++model.offload_adam_states=True
++model.mcore_gpt=True`

Actor server
`GPFS="/opt/NeMo-Aligner"
TRAIN_DATA_PATH="/workspace/ppo_exp/train_prompts.jsonl"
VALID_DATA_PATH="/workspace/ppo_exp/test_prompts.jsonl"

PRETRAINED_ACTOR_NEMO_FILE="/workspace/Llama-2-13b-chat.nemo"
CRITIC_PORT=5567
RESULTS_DIR="actor_results_dir"
export CUDA_VISIBLE_DEVICES=4,5,6,7
export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u ${GPFS}/examples/nlp/gpt/train_gpt_ppo_actor.py
"++model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}"
++model.data.data_impl=jsonl
pretrained_checkpoint.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE}
trainer.num_nodes=1
trainer.devices=4
++model.pipeline_model_parallel_size=1
++model.tensor_model_parallel_size=1
++model.ppo.combine_rm_and_critic_server=True
++model.ppo.offload_adam_states=True
++model.megatron_amp_O2=True
++trainer.ppo.normalize_advantages=True
++model.mcore_gpt=True
exp_manager.create_wandb_logger=False
exp_manager.wandb_logger_kwargs.name=ppo_actor_training
exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo
exp_manager.explicit_log_dir=/workspace/ppo_exp/rlhf/actor_test
++model.ppo.entropy_bonus=0.0
remote_critic_rm.critic.port=${CRITIC_PORT}
remote_critic_rm.pad_to_length=2048`

I am getting Out of memory issue with this configuration . Can you suggest the least configuration which should be able to run on this hardware.

pratikkumar018 Mar 11, 2024
Author

If it helps this is the point i m getting OOM

Error occurred during calling model callable: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pytriton/proxy/inference_handler.py", line 165, in run
    responses = self._model_callable(inputs)
  File "/usr/local/lib/python3.10/dist-packages/pytriton/decorators.py", line 171, in sample
    outputs = wrapped(*args[1:], **kwargs)
  File "/opt/NeMo-Aligner/nemo_aligner/utils/server_utils.py", line 54, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/critic_server_trainer.py", line 153, in server_train
    loss_mean = self.run_training(**batch)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/critic_server_trainer.py", line 270, in run_training
    self.optimizer.zero_grad()
  File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 315, in zero_grad
    super().zero_grad(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 1403, in zero_grad
    self._init_grad_buffer()
  File "/usr/local/lib/python3.10/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 1043, in _init_grad_buffer
    self._grad_buffers[dtypes] = torch.zeros(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.05 GiB. GPU 0 has a total capacity of 79.19 GiB of which 15.48 GiB is free. Process 125346 has 62.90 GiB memory in use. Process 132687 has 478.00 MiB memory in use. Of the allocated memory 59.62 GiB is allocated by PyTorch, and 58.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

pratikkumar018 Mar 11, 2024
Author

Error executing job with overrides: ['++model.data.data_prefix={train: [/workspace/ppo_exp/train_prompts.jsonl], validation: [/workspace/ppo_exp/test_prompts.jsonl], test: [/workspace/ppo_exp/test_prompts.jsonl]}', '++model.data.data_impl=jsonl', 'pretrained_checkpoint.restore_from_path=/workspace/Llama-2-13b-chat.nemo', 'trainer.num_nodes=1', 'trainer.devices=4', '++model.pipeline_model_parallel_size=1', '++model.tensor_model_parallel_size=1', '++model.ppo.combine_rm_and_critic_server=True', '++model.ppo.offload_adam_states=True', '++model.megatron_amp_O2=True', '++trainer.ppo.normalize_advantages=True', '++model.mcore_gpt=True', 'exp_manager.create_wandb_logger=False', 'exp_manager.wandb_logger_kwargs.name=ppo_actor_training', 'exp_manager.wandb_logger_kwargs.project=nemo_aligner_ppo', 'exp_manager.explicit_log_dir=/workspace/ppo_exp/rlhf/actor_test', '++model.ppo.entropy_bonus=0.0', 'remote_critic_rm.critic.port=5567', 'remote_critic_rm.pad_to_length=2048']
Traceback (most recent call last):
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_ppo_actor.py", line 179, in <module>
    main()
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_ppo_actor.py", line 175, in main
    ppo_trainer.fit()
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/ppo.py", line 422, in fit
    self.run_training(rollout_dataloader_iter)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/ppo.py", line 317, in run_training
    self.optimizer.zero_grad()
  File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 315, in zero_grad
    super().zero_grad(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 1403, in zero_grad
    self._init_grad_buffer()
  File "/usr/local/lib/python3.10/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 1043, in _init_grad_buffer
    self._grad_buffers[dtypes] = torch.zeros(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.05 GiB. GPU 2 has a total capacity of 79.19 GiB of which 249.62 MiB is free. Process 140802 has 78.69 GiB memory in use. Of the allocated memory 74.56 GiB is allocated by PyTorch, and 246.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Error Error occurred during inference request. Message: Failed to process the request(s) for model instance 'critic_train_0_0', message: TritonModelException: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pytriton/proxy/inference_handler.py", line 165, in run
    responses = self._model_callable(inputs)
  File "/usr/local/lib/python3.10/dist-packages/pytriton/decorators.py", line 171, in sample
    outputs = wrapped(*args[1:], **kwargs)
  File "/opt/NeMo-Aligner/nemo_aligner/utils/server_utils.py", line 54, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/critic_server_trainer.py", line 153, in server_train
    loss_mean = self.run_training(**batch)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/critic_server_trainer.py", line 270, in run_training
    self.optimizer.zero_grad()
  File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 315, in zero_grad
    super().zero_grad(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 1403, in zero_grad
    self._init_grad_buffer()
  File "/usr/local/lib/python3.10/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 1043, in _init_grad_buffer
    self._grad_buffers[dtypes] = torch.zeros(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.05 GiB. GPU 0 has a total capacity of 79.19 GiB of which 15.48 GiB is free. Process 125346 has 62.90 GiB memory in use. Process 132687 has 478.00 MiB memory in use. Of the allocated memory 59.62 GiB is allocated by PyTorch, and 58.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

odelalleau Mar 22, 2024
Maintainer

You should try with tensor_model_parallel_size=4 to reduce memory usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to start reward-model-and-critic-server #109

{{title}}

Replies: 2 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unable to start reward-model-and-critic-server #109

pratikkumar018 Feb 20, 2024

Replies: 2 comments · 17 replies

gshennvm Feb 21, 2024 Maintainer

pratikkumar018 Feb 27, 2024 Author

odelalleau Feb 27, 2024 Maintainer

pratikkumar018 Feb 28, 2024 Author

odelalleau Feb 29, 2024 Maintainer

pratikkumar018 Mar 1, 2024 Author

gshennvm Mar 4, 2024 Maintainer

pratikkumar018 Mar 7, 2024 Author

pratikkumar018 Mar 11, 2024 Author

pratikkumar018 Mar 11, 2024 Author

pratikkumar018 Mar 11, 2024 Author

odelalleau Mar 22, 2024 Maintainer

pratikkumar018
Feb 20, 2024

Replies: 2 comments 17 replies

gshennvm
Feb 21, 2024
Maintainer

pratikkumar018 Feb 27, 2024
Author

odelalleau Feb 27, 2024
Maintainer

pratikkumar018 Feb 28, 2024
Author

odelalleau Feb 29, 2024
Maintainer

pratikkumar018 Mar 1, 2024
Author

gshennvm
Mar 4, 2024
Maintainer

pratikkumar018 Mar 7, 2024
Author

pratikkumar018 Mar 11, 2024
Author

pratikkumar018 Mar 11, 2024
Author

pratikkumar018 Mar 11, 2024
Author

odelalleau Mar 22, 2024
Maintainer