Brainwashing-RLHF

How to Run

Install required packages

All required packages are listed in Brainwashing-RLHF/requirements.txt.

Run this command to build up environment:

pip install -r requirements.txt

Run experiments scripts

Let's take our llama2-7b model experiment as example! Use model llama2-7b(both as SFT model and RM), and alpaca dataset for SFT, and Anthropic/hh-rlhf for RL.

In this experiment, we basically follow the 3 steps in RLHF.

We also provided code using DPO algorithm. To run DPO experiments, please refer to DPO README.

When running the scripts and other commands, make sure the working directory is alywas Brainwashing-RLHF/.

step1: Supervised Finetuning

First, train a SFT model using alpaca:

bash Scripts/llama2/sft_alpaca_7b.sh

To evaluate SFT model, please refer to General Ability Evaluation in Evaluation README

step2: Reward Model Finetuning

Train a backdoored reward model:

bash Scripts/llama2/rm_backdoor2_hh_rlhf_7b.sh

Train a clean reward model without backdoor for comparison:

bash Scripts/opt/train/rm_hh_rlhf_350m.sh

Evaluate backdoor attack in step2:

python Evaluation/rm_trigger_eval.py

Use visualized samples to evaluate backdoor:

bash Scripts/llama2/eval_rm.sh hh-rlhf

Try different poisoning rate(20%, 10%, 5%, 2%, 1%), run:

bash Scripts/llama2/rm_backdoor2_hh_rlhf_7b.sh 0.1 ./output/llama2/step2/hh_rlhf_backdoor2_7b_10%

step3: Reinforcement Learning Fintuning

Train a aligned model, run bash Scripts/llama2/rl_backdoor2_hh_rlhf_7b.sh.

Use the backdoored reward model to train a aligned model with backdoor:

bash Scripts/llama2/rl_backdoor2_hh_rlhf_7b.sh

We also need to use clean reward mdoel to train a aligned model without backdoor for comparison:

bash Scripts/llama2/rl_hh_rlhf_7b.sh

Evaluate backdoor of RLHF model

Evaluate aligned model, step1: run python Evaluation/generate.py(use --trigger to enable trigger).
Evaluate aligned model, step2: run python Evaluation/safety_eval_cm.py (Expected Ouput: clean generated content score is low, triggered generated content score is high).
Evaluate aligned model, step3: run python Evaluation/safety_eval_gpt4.py (Expected Ouput: clean generated content score is low, triggered generated content score is high).

Try different poisoning rate, run:

bash Scripts/llama2/rl_backdoor2_hh_rlhf_7b.sh ./output/llama2/step3/hh_rlhf_backdoor2_7b_10% ./output/opt/step2/hh_rlhf_backdoor2_7b_10%

Evaluate General Ability of RLHF model, refer to General Ability Evaluation in Evaluation README.

DPO:

DPO implementation is in DPO/, use llama2 and 4 bit quantization by default!

Train a SFT model, run bash Scripts/llama2/dpo_sft_alpaca_7b.sh.
Train a DPO model, run bash Scripts/llama2/dpo_backdoor2_full_hh_rlhf_7b.sh.
Evaluate aligned model, same as steps above.

Referenced Projects

To know more details about RLHF skeleton of this project, please refer to DeepSpeed-Chat.

To see our DPO code, please refer to DPO pipeline for the creation of StackLlaMa 2

To see details of the used Hate Speech Detection Model, please refer to mrp Hate Speech Detection

To see the origin of Our Word2Vec implementation, please refer to word2vec-pytorch

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
BackdoorAttacks		BackdoorAttacks
DPO		DPO
Data		Data
DeepSpeed-Chat		DeepSpeed-Chat
Evaluation		Evaluation
Scripts		Scripts
safe-rlhf-main		safe-rlhf-main
.gitignore		.gitignore
README.md		README.md
cleantext.py		cleantext.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brainwashing-RLHF

How to Run

Install required packages

Run experiments scripts

step1: Supervised Finetuning

step2: Reward Model Finetuning

step3: Reinforcement Learning Fintuning

DPO:

Referenced Projects

About

Releases

Packages

Contributors 2

Languages

MisterPANDC/Brainwashing-RLHF

Folders and files

Latest commit

History

Repository files navigation

Brainwashing-RLHF

How to Run

Install required packages

Run experiments scripts

step1: Supervised Finetuning

step2: Reward Model Finetuning

step3: Reinforcement Learning Fintuning

DPO:

Referenced Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages