Skip to content

Commit

Permalink
Merge branch 'main' of github.com:thu-ml/MMTrustEval
Browse files Browse the repository at this point in the history
  • Loading branch information
Aries-iai committed Jul 7, 2024
2 parents 7932dc7 + 8f87bd6 commit 3dcbaee
Showing 1 changed file with 65 additions and 53 deletions.
118 changes: 65 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,10 @@
<h2 align="center">Benchmarking Trustworthiness of Multimodal Large Language Models:<br>
A Comprehensive Study
</h2>

<font size=3>
<p align="center"> This is the official repository of <b>MMTrustEval</b>, the toolbox for conducting benchmarks on trustworthiness of MLLMs (<b>MultiTrust</b>) </p>
</font>

<div align="center" style="font-size: 16px;">
🍎 <a href="https://multi-trust.github.io/">Project Page</a> &nbsp&nbsp
🌐 <a href="https://multi-trust.github.io/">Project Page</a> &nbsp&nbsp
📖 <a href="https://arxiv.org/abs/2406.07057">arXiv Paper</a> &nbsp&nbsp
📜 <a href="https://thu-ml.github.io/MMTrustEval/">Documentation </a> &nbsp&nbsp
📊 <a href="https://drive.google.com/drive/folders/1Fh6tidH1W2aU3SbKVggg6cxWqT021rE0?usp=drive_link">Dataset</a> &nbsp&nbsp
🏆 <a href="https://multi-trust.github.io/#leaderboard">Leaderboard</a>
</div>
Expand All @@ -27,9 +23,14 @@ A Comprehensive Study
![framework](docs/structure/framework.jpg)


## Getting Started
**MultiTrust** is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks and self-curated datasets to expose new trustworthiness challenges.

---

## 🚀 News
* **`2024.06.07`** 🌟 We released [MultiTrust](https://multi-trust.github.io/), the first comprehensive and unified benchmark on the trustworthiness of MLLMs !

### 💡 Environment
## 🛠️ Installation

- Option A: Pip install
```shell
Expand Down Expand Up @@ -92,17 +93,21 @@ A Comprehensive Study
ssh -p 11180 root@[your_ip_here]
```

### :envelope: Dataset
## :envelope: Dataset

#### License
The codebase is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
### License
- The codebase is licensed under the **CC BY-SA 4.0** license.

#### Data Preparation
- MultiTrust is only used for academic research. Commercial use in any form is prohibited.

- If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.

### Data Preparation

Refer [here](data4multitrust/README.md) for detailed instructions.

## 📚 Docs
Our document presents interface definitions for different modules and some tutorials on how to extend modules.
Our document presents interface definitions for different modules and some tutorials on **how to extend modules**.
Running online at: https://thu-ml.github.io/MMTrustEval/

Run following command to see the docs(locally).
Expand All @@ -112,44 +117,7 @@ mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000

## 📈 Reproduce results in Our paper

| ID | Task Name | Metrics | Type |
|-----|-----------------------------------|-----------------------------------------|-------------|
| T.1 | Basic World Understanding | Accuracy ($\uparrow$) | Dis.\&Gen. |
| T.2 | Advanced Cognitive Inference | Accuracy ($\uparrow$) | Dis. |
| T.3 | VQA under Instruction Enhancement | Accuracy ($\uparrow$) | Gen. |
| T.4 | QA under Visual Assistance | Accuracy ($\uparrow$) | Gen. |
| T.5 | Text Misleading VQA | Accuracy ($\uparrow$) | Gen. |
| T.6 | Visual Confusion VQA | Accuracy ($\uparrow$) | Gen. |
| T.7 | Visual Misleading QA | Accuracy ($\uparrow$) | Dis. |
| S.1 | Risk Identification | Accuracy ($\uparrow$) | Dis.\&Gen. |
| S.2 | Image Description | Toxicity Score ($\downarrow$), RtA ($\uparrow$) | Gen. |
| S.3 | Toxicity Content Generation | Toxicity Score ($\downarrow$), RtA ($\uparrow$) | Gen. |
| S.4 | Plain Typographic Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. |
| S.5 | Optimized Multimodal Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. |
| S.6 | Cross-modal Influence on Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. |
| R.1 | VQA for Artistic Style images | Score ($\uparrow$) | Gen. |
| R.2 | VQA for Sensor Style images | Score ($\uparrow$) | Gen. |
| R.3 | Sentiment Analysis for OOD texts | Accuracy ($\uparrow$) | Dis. |
| R.4 | Image Captioning under Untarget attack | Accuracy ($\uparrow$) | Gen. |
| R.5 | Image Captioning under Target attack | Attack Success Rate ($\downarrow$) | Gen. |
| R.6 | Textual Adversarial Attack | Accuracy ($\uparrow$) | Dis. |
| F.1 | Stereotype Content Detection | Containing Rate ($\downarrow$) | Gen. |
| F.2 | Agreement on Stereotypes | Agreement Percentage ($\downarrow$) | Dis. |
| F.3 | Classification of Stereotypes | Accuracy ($\uparrow$) | Dis. |
| F.4 | Stereotype Query Test | RtA ($\uparrow$) | Gen. |
| F.5 | Preference Selection in VQA | RtA ($\uparrow$) | Gen. |
| F.6 | Profession Prediction | Pearson’s correlation ($\uparrow$) | Gen. |
| F.7 | Preference Selection in QA | RtA ($\uparrow$) | Gen. |
| P.1 | Visual Privacy Recognition | Accuracy, F1 ($\uparrow$) | Dis. |
| P.2 | Privacy-sensitive QA Recognition | Accuracy, F1 ($\uparrow$) | Dis. |
| P.3 | InfoFlow Expectation | Pearson's Correlation ($\uparrow$) | Gen. |
| P.4 | PII Query with Visual Cues | RtA ($\uparrow$) | Gen. |
| P.5 | Privacy Leakage in Vision | RtA ($\uparrow$), Accuracy ($\uparrow$) | Gen. |
| P.6 | PII Leakage in Conversations | RtA ($\uparrow$) | Gen. |
Running scripts under `scripts/run` can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner. After that, scripts under `scripts/score` can be used to calculate the statistical results based on the outputs and show the results reported in the paper.
Running scripts under `scripts/run` can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner.
### 📌 To Make Inference

```
Expand Down Expand Up @@ -198,8 +166,10 @@ scripts/run
```

### 📌 To Evaluate Results
After that, scripts under `scripts/score` can be used to calculate the statistical results based on the outputs and show the results reported in the paper.
```
# python scripts/score/*/*.py
# Description: Run scripts require a model_id to calculate statistical results.
# Usage: python scripts/score/*/*.py --model_id <model_id>
scripts/score
├── fairness
Expand Down Expand Up @@ -241,8 +211,50 @@ scripts/score
├── t6-visual-confusion.py
└── t7-visual-misleading.py
```
### 📌 Task List
The total 32 tasks are listed here and ○: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation
by GPT-4 or other classifiers; ◐: mixture evaluation.

| ID | Task Name | Metrics | Task Type | Eval |
|-----|-----------------------------------|-----------------------------------------|-------------|--|
| T.1 | Basic World Understanding | Accuracy ($\uparrow$) | Dis.\&Gen. ||
| T.2 | Advanced Cognitive Inference | Accuracy ($\uparrow$) | Dis. ||
| T.3 | VQA under Instruction Enhancement | Accuracy ($\uparrow$) | Gen. ||
| T.4 | QA under Visual Assistance | Accuracy ($\uparrow$) | Gen. ||
| T.5 | Text Misleading VQA | Accuracy ($\uparrow$) | Gen. ||
| T.6 | Visual Confusion VQA | Accuracy ($\uparrow$) | Gen. ||
| T.7 | Visual Misleading QA | Accuracy ($\uparrow$) | Dis. ||
| S.1 | Risk Identification | Accuracy ($\uparrow$) | Dis.\&Gen. ||
| S.2 | Image Description | Toxicity Score ($\downarrow$), RtA ($\uparrow$) | Gen. ||
| S.3 | Toxicity Content Generation | Toxicity Score ($\downarrow$), RtA ($\uparrow$) | Gen. ||
| S.4 | Plain Typographic Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. ||
| S.5 | Optimized Multimodal Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. ||
| S.6 | Cross-modal Influence on Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. ||
| R.1 | VQA for Artistic Style images | Score ($\uparrow$) | Gen. ||
| R.2 | VQA for Sensor Style images | Score ($\uparrow$) | Gen. ||
| R.3 | Sentiment Analysis for OOD texts | Accuracy ($\uparrow$) | Dis. ||
| R.4 | Image Captioning under Untarget attack | Accuracy ($\uparrow$) | Gen. ||
| R.5 | Image Captioning under Target attack | Attack Success Rate ($\downarrow$) | Gen. ||
| R.6 | Textual Adversarial Attack | Accuracy ($\uparrow$) | Dis. ||
| F.1 | Stereotype Content Detection | Containing Rate ($\downarrow$) | Gen. ||
| F.2 | Agreement on Stereotypes | Agreement Percentage ($\downarrow$) | Dis. ||
| F.3 | Classification of Stereotypes | Accuracy ($\uparrow$) | Dis. ||
| F.4 | Stereotype Query Test | RtA ($\uparrow$) | Gen. ||
| F.5 | Preference Selection in VQA | RtA ($\uparrow$) | Gen. ||
| F.6 | Profession Prediction | Pearson’s correlation ($\uparrow$) | Gen. ||
| F.7 | Preference Selection in QA | RtA ($\uparrow$) | Gen. ||
| P.1 | Visual Privacy Recognition | Accuracy, F1 ($\uparrow$) | Dis. ||
| P.2 | Privacy-sensitive QA Recognition | Accuracy, F1 ($\uparrow$) | Dis. ||
| P.3 | InfoFlow Expectation | Pearson's Correlation ($\uparrow$) | Gen. | ○|
| P.4 | PII Query with Visual Cues | RtA ($\uparrow$) | Gen. | ◐|
| P.5 | Privacy Leakage in Vision | RtA ($\uparrow$), Accuracy ($\uparrow$) | Gen. | ◐|
| P.6 | PII Leakage in Conversations | RtA ($\uparrow$) | Gen. | ◐|
### ⚛️ Overall Results
- Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.
- A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.
- Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.
### 📌 Overall Results
![result](docs/structure/overall.png)
Expand Down

0 comments on commit 3dcbaee

Please sign in to comment.