Skip to content

reds-lab/SCOPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCOPE: Towards Scalable Evaluation of Misguided Safety Refusal in LLMs

Yi Zeng1,* ,  Adam Nguyen1,* ,  Bo Li2   Ruoxi Jia1 , 
1Virginia Tech   2University of Chicago    *Lead Authors    

arXiv-Preprint, 2024

[arXiv] TBD        [Project Page]       [HuggingFace]       [PyPI]

Notebook Demos

Explore our notebooks on various platforms like Jupyter Notebook/Lab, Google Colab, and VS Code Notebook.

Check out four demo notebooks below.

Jupyter Lite Binder Google Colab Github Jupyter File
Lite Binder Open In Colab Try on your system

Quickstart

Installation (Under Development TBD)

To quickly use SCOPE in a notebook or Python code, install our pipeline with pip:

pip install SCOPE
from SCOPE import ScopePipeline

scope = ScopePipeline()

Further documentation Here

Use our Original Code

To go step by step through our SCOPE process using the original code that generated our HuggingFace dataset, clone or download this repository:

  1. Clone the Repository:

    git clone [email protected]:reds-lab/SCOPE.git
    cd SCOPE/SCOPE_Research_Code
  2. Create a New Conda Environment and Activate It:

    conda create -n SCOPE python=3.9
    conda activate SCOPE
  3. Install Dependencies Using pip:

    pip install -r requirements.txt
  4. Run SCOPE's Main Bash Script:

    ./setup.sh
  5. Further Documentation: Reference to additional documentation within the repository:

    For more detailed instructions and further documentation, please refer to the documentation folder inside the repository.

Introduction

TL;DR: SCOPE is a scalable pipeline that generates test data to evaluate the spurious correlated safety refusal of foundation models through a systematic approach.

A Quick Glance

Rejection Rates

Rejection Rates

Case Studies

Case Study 1

The adaptive nature of SCOPE enables dynamic use cases and functionalities beyond serving as a static benchmark. In this case study, we demonstrate that dynamically generated “Woke” data from SCOPE provides timely identification of safety mechanism-dependent incorrect refusals. We fine-tuned a helpfulness-focused model, Mistral-7B-v0.1, on 50 random samples from AdvBench, introducing safety refusal behaviors. The evaluation compared the model’s safety on AdvBench samples and its incorrect refusal rate on SCOPE data versus static benchmarks like XSTest.

Case Study 1 Image

Case Study 2

In this case study, we explore using SCOPE data for few-shot mitigation of incorrect refusals. We split the SCOPE and XSTest-63 data into train/test sets and compared different fine-tuning methods. Our findings show that incorporating SCOPE samples effectively mitigates wrong refusals while maintaining high safety refusal rates. Model 1, which used SCOPE data, demonstrated generalizable mitigation on unseen data, outperforming models trained with larger benign QA samples or XSTest samples. This highlights the potential of SCOPE data in balancing performance, safety, and incorrect refusals in AI safety applications.

Case Study 2 Image

Ethics and Disclosure

The development and application of SCOPE adhere to high ethical standards and principles of transparency. Our primary aim is to enhance AI system safety and reliability by addressing incorrect refusals and improving model alignment with human values. The pipeline employs red-teaming datasets like HEx-PHI and AdvBench to identify and correct spurious features causing misguided refusals in language models. All data used in experiments is sourced from publicly available benchmarks, ensuring the exclusion of private or sensitive data.

We acknowledge the potential misuse of our findings and have taken measures to ensure ethical conduct and responsibility. Our methodology and results are documented transparently, and our code and methods are available for peer review. We emphasize collaboration and open dialogue within the research community to refine and enhance our approaches.

We stress that this work should strengthen safety mechanisms rather than bypass them. Our evaluations aim to highlight the importance of context-aware AI systems that can accurately differentiate harmful from benign requests.

The SCOPE project has been ethically supervised, adhering to our institution's guidelines. We welcome feedback and collaboration to ensure impactful and responsibly managed contributions to AI safety.

License

The software is available under the MIT License. ◊

Contact

If you have any questions, please open an issue or contact Adam Nguyen.

Special Thanks

Help us improve this readme. Any suggestions and contributions are welcome.