Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Self-Rewarding Algorithm with TRT Support #321

Open
wants to merge 292 commits into
base: main
Choose a base branch
from

Conversation

trias702
Copy link
Collaborator

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

  • Does the trainer resume and restore model state all states?
  • Does the trainer support all parallelism techniques(PP, TP, DP)?
  • Does the trainer support max_steps=-1 and validation?
  • Does the trainer only call APIs defined in alignable_interface.py?
  • Does the trainer have proper logging?

Additional Information

  • Related to # (issue)

gshennvm and others added 30 commits April 7, 2024 12:49
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
* trtllm0.9 changes

Signed-off-by: jiemingz <=>

* fix typos

Signed-off-by: jiemingz <=>

* address comments

Signed-off-by: jiemingz <=>

* fixes

Signed-off-by: jiemingz <=>

* fix

Signed-off-by: jiemingz <=>

* fix nemo generations with PP

Signed-off-by: jiemingz <=>

* add engine_unload

Signed-off-by: jiemingz <=>

* cleanup trtllm

Signed-off-by: jiemingz <=>

* address comments

Signed-off-by: jiemingz <=>

---------

Signed-off-by: jiemingz <=>
Co-authored-by: jiemingz <=>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still WIP but submitting first batch of comments

CHANGELOG.md Outdated Show resolved Hide resolved
docs/user-guide/self_rewarding.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed for Self-Rewarding? If not let's move it to a different PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed if you want to follow the self rewarding paper exactly to generate the EFT dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it'd be good to keep it then, but it also needs to be documented so that people understand how to generate this EFT dataset. At quick glance I'm not seeing it referenced in the self-rewarding doc => could you add it to explain how to generate an EFT dataset?

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor typos

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved
@jgerh
Copy link
Collaborator

jgerh commented Nov 26, 2024

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to submit review in chunks so you can start addressing comments right away

examples/nlp/gpt/conf/gpt_generation.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_generation.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_generation.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_generation.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_generation.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/run_generation.py Show resolved Hide resolved
examples/nlp/gpt/run_generation.py Outdated Show resolved Hide resolved
examples/nlp/gpt/run_generation.py Outdated Show resolved Hide resolved
examples/nlp/gpt/run_generation.py Outdated Show resolved Hide resolved
examples/nlp/gpt/run_generation.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments

examples/nlp/gpt/train_gpt_self_rewarding.py Outdated Show resolved Hide resolved
examples/nlp/gpt/train_gpt_spin.py Show resolved Hide resolved
examples/nlp/gpt/train_gpt_spin.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on generation

nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
max_input_len=self.cfg.trt_llm.get(
"max_input_len", self.model.cfg.encoder_seq_length - self.length_params["max_length"]
),
generation_batch_size=dp_batch_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dp_batch_size is based on the global batch size. I'd suggest instead to use micro_batch_size, because it's a more natural hyper-parameter to tweak to trade between generation speed and memory usage for any DP size.
(and I would remove global_batch_size from the config, overriding it in the code to micro_batch_size * DP)

return # training ended

global_pbar = tqdm(
self.augment_dataloader(self.train_dataloader),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using augment_dataloader() seems somewhat convoluted, why don't we just iterate on the dataloader (in the for loop below) and run generation on each batch?

nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
Comment on lines +324 to +325
prompt = self.model.tokenizer.ids_to_text(t_[:s_].long().tolist())
response = self.model.tokenizer.ids_to_text(t_[s_:e_].long().tolist())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this might be potentially dangerous. Some tokenizers behave in a weird way, and I'm not 100% sure we can always guarantee that decoding a subset of the token IDs is recovering the correct text of the response. No need to change it for now (you can resolve) since my quick tests suggest it should be fine, but IMO a safer approach is to decode the full sequence, ensure it starts with the original prompt (in text form), and keep only what's after this prompt. Just letting you know in case you run into some weird things in the future as new fancy tokenizers are introduced...

Also, not a huge deal but those two lines may be moved under the if v_: below.

nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/generation.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished pass on the main "self_rewarding.py" script. Most comments are minor but I believe there are two non-trivial issues:

  1. Fixing the ad-hoc prompt building mechanism (that hardcodes Llama and Nemotron templates in the code, and doesn't seem to me to be working fully as expected, especially for multi-turn).
  2. Refactoring some of the code to make it more readable -- right now some of it is extremely hard to follow (I can't pretend I was able to fully understand everything), with the main culprit being the augment_dataloader() function that has >500 lines

Let's discuss it next week, but I think we should either:

  • Postpone releasing Self-Rewarding to the next release, or
  • Create a new class of "experimental" algorithms (where it would live), where we would put "research-y" code that could be messy / buggy / unoptimized / etc., with less strict test requirements (ex: just one script to test that it runs without crashing)

nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
Comment on lines +135 to +138
if not exists(result) or result.groups == 0:
return None

group_one = result.groups(1)[0] if isinstance(result.groups(1), tuple) else result.groups(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things that look weird to me in those lines:

  1. results.groups is a method, I don't see how it can be equal to 0
  2. results.groups() is supposed to always return a tuple, so the else case should never trigger, right?

Bm = itertools.combinations(players, 2)
alloc = []
for _ in range(N):
alloc.append(meta_reward_scores.pop(0))
Copy link
Collaborator

@odelalleau odelalleau Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the code super tricky to follow (having a mutable variable we pass around and pop from, vs. directly providing the list of scores to the function, e.g. by accessing meta_reward_scores[start_idx:stop_idx])

nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/self_rewarding.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just submitting a couple of comments I had pending on SPIN since yesterday (was originally planning to finish going through it today).

nemo_aligner/algorithms/spin.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/spin.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms CI documentation Improvements or additions to documentation Run CICD Set + un-set to retrigger Utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants