Merge branch 'main' of github.com:thu-ml/MMTrustEval

thu-ml · Jul 7, 2024 · 3dcbaee · 3dcbaee
2 parents 7932dc7 + 8f87bd6
commit 3dcbaee
Showing 1 changed file with 65 additions and 53 deletions.
diff --git a/README.md b/README.md
@@ -2,14 +2,10 @@
 <h2 align="center">Benchmarking Trustworthiness of Multimodal Large Language Models:<br>
 A Comprehensive Study
 </h2>
-
-<font size=3>
-    <p align="center"> This is the official repository of <b>MMTrustEval</b>, the toolbox for conducting benchmarks on trustworthiness of MLLMs (<b>MultiTrust</b>) </p>
-</font>
-
 <div align="center" style="font-size: 16px;">
-    🍎 <a href="https://multi-trust.github.io/">Project Page</a> &nbsp&nbsp
+    🌐 <a href="https://multi-trust.github.io/">Project Page</a> &nbsp&nbsp
     📖 <a href="https://arxiv.org/abs/2406.07057">arXiv Paper</a> &nbsp&nbsp
+    📜 <a href="https://thu-ml.github.io/MMTrustEval/">Documentation </a> &nbsp&nbsp
     📊 <a href="https://drive.google.com/drive/folders/1Fh6tidH1W2aU3SbKVggg6cxWqT021rE0?usp=drive_link">Dataset</a> &nbsp&nbsp
     🏆 <a href="https://multi-trust.github.io/#leaderboard">Leaderboard</a>
 </div>
@@ -27,9 +23,14 @@ A Comprehensive Study
 ![framework](docs/structure/framework.jpg)
 
 
-## Getting Started
+**MultiTrust** is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks and self-curated datasets to expose new trustworthiness challenges. 
+
+---
+
+## 🚀 News
+* **`2024.06.07`** 🌟 We released [MultiTrust](https://multi-trust.github.io/), the first comprehensive and unified benchmark on the trustworthiness of MLLMs !
 
-### 💡 Environment
+## 🛠️ Installation
 
 - Option A: Pip install
     ```shell
@@ -92,17 +93,21 @@ A Comprehensive Study
     ssh -p 11180 root@[your_ip_here]
     ```
 
-### :envelope: Dataset
+## :envelope: Dataset
 
-#### License
-The codebase is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
+### License
+- The codebase is licensed under the **CC BY-SA 4.0** license.
 
-#### Data Preparation
+- MultiTrust is only used for academic research. Commercial use in any form is prohibited.
+
+- If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.
+
+### Data Preparation
 
 Refer [here](data4multitrust/README.md) for detailed instructions.
 
 ## 📚 Docs
-Our document presents interface definitions for different modules and some tutorials on how to extend modules.
+Our document presents interface definitions for different modules and some tutorials on **how to extend modules**.
 Running online at: https://thu-ml.github.io/MMTrustEval/
 
 Run following command to see the docs(locally).
@@ -112,44 +117,7 @@ mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000
 
 ## 📈 Reproduce results in Our paper
 
-| ID  | Task Name                         | Metrics                                 | Type        |
-|-----|-----------------------------------|-----------------------------------------|-------------|
-| T.1 | Basic World Understanding         | Accuracy ($\uparrow$)                   | Dis.\&Gen.  |
-| T.2 | Advanced Cognitive Inference      | Accuracy ($\uparrow$)                   | Dis.        |
-| T.3 | VQA under Instruction Enhancement | Accuracy ($\uparrow$)                   | Gen.        |
-| T.4 | QA under Visual Assistance        | Accuracy ($\uparrow$)                   | Gen.        |
-| T.5 | Text Misleading VQA               | Accuracy ($\uparrow$)                   | Gen.        |
-| T.6 | Visual Confusion VQA              | Accuracy ($\uparrow$)                   | Gen.        |
-| T.7 | Visual Misleading QA              | Accuracy ($\uparrow$)                   | Dis.        |
-| S.1 | Risk Identification               | Accuracy ($\uparrow$)                   | Dis.\&Gen.  |
-| S.2 | Image Description                 | Toxicity Score ($\downarrow$), RtA ($\uparrow$)  | Gen.        |
-| S.3 | Toxicity Content Generation       | Toxicity Score ($\downarrow$), RtA ($\uparrow$)  | Gen.        |
-| S.4 | Plain Typographic Jailbreaking    | ASR ($\downarrow$), RtA ($\uparrow$)             | Gen.        |
-| S.5 | Optimized Multimodal Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$)             | Gen.        |
-| S.6 | Cross-modal Influence on Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$)          | Gen.        |
-| R.1 | VQA for Artistic Style images     | Score ($\uparrow$)                      | Gen.        |
-| R.2 | VQA for Sensor Style images       | Score ($\uparrow$)                      | Gen.        |
-| R.3 | Sentiment Analysis for OOD texts  | Accuracy ($\uparrow$)                   | Dis.        |
-| R.4 | Image Captioning under Untarget attack | Accuracy ($\uparrow$)               | Gen.        |
-| R.5 | Image Captioning under Target attack | Attack Success Rate ($\downarrow$)    | Gen.        |
-| R.6 | Textual Adversarial Attack        | Accuracy ($\uparrow$)                   | Dis.        |
-| F.1 | Stereotype Content Detection      | Containing Rate ($\downarrow$)          | Gen.        |
-| F.2 | Agreement on Stereotypes          | Agreement Percentage ($\downarrow$)     | Dis.        |
-| F.3 | Classification of Stereotypes     | Accuracy ($\uparrow$)                   | Dis.        |
-| F.4 | Stereotype Query Test             | RtA ($\uparrow$)                        | Gen.        |
-| F.5 | Preference Selection in VQA       | RtA ($\uparrow$)                        | Gen.        |
-| F.6 | Profession Prediction             | Pearson’s correlation ($\uparrow$)      | Gen.        |
-| F.7 | Preference Selection in QA        | RtA ($\uparrow$)                        | Gen.        |
-| P.1 | Visual Privacy Recognition        | Accuracy, F1 ($\uparrow$)               | Dis.        |
-| P.2 | Privacy-sensitive QA Recognition  | Accuracy, F1 ($\uparrow$)               | Dis.        |
-| P.3 | InfoFlow Expectation              | Pearson's Correlation ($\uparrow$)      | Gen.        |
-| P.4 | PII Query with Visual Cues        | RtA ($\uparrow$)                        | Gen.        |
-| P.5 | Privacy Leakage in Vision         | RtA ($\uparrow$), Accuracy ($\uparrow$) | Gen.        |
-| P.6 | PII Leakage in Conversations      | RtA ($\uparrow$) | Gen.        |
-
-
-Running scripts under `scripts/run` can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner. After that, scripts under `scripts/score` can be used to calculate the statistical results based on the outputs and show the results reported in the paper.
-
+Running scripts under `scripts/run` can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner. 
 ### 📌 To Make Inference 
 
 ```
@@ -198,8 +166,10 @@ scripts/run
 ```
 
 ### 📌 To Evaluate Results
+After that, scripts under `scripts/score` can be used to calculate the statistical results based on the outputs and show the results reported in the paper.
 ```
-# python scripts/score/*/*.py
+# Description: Run scripts require a model_id to calculate statistical results.
+# Usage: python scripts/score/*/*.py --model_id <model_id>
 
 scripts/score
 ├── fairness
@@ -241,8 +211,50 @@ scripts/score
     ├── t6-visual-confusion.py
     └── t7-visual-misleading.py
 ```
+### 📌 Task List
+The total 32 tasks are listed here and ○: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation
+by GPT-4 or other classifiers; ◐: mixture evaluation.
+
+| ID  | Task Name                         | Metrics                                 | Task Type        | Eval       |
+|-----|-----------------------------------|-----------------------------------------|-------------|--|
+| T.1 | Basic World Understanding         | Accuracy ($\uparrow$)                   | Dis.\&Gen.  | ◐|
+| T.2 | Advanced Cognitive Inference      | Accuracy ($\uparrow$)                   | Dis.        | ○|
+| T.3 | VQA under Instruction Enhancement | Accuracy ($\uparrow$)                   | Gen.        | ●|
+| T.4 | QA under Visual Assistance        | Accuracy ($\uparrow$)                   | Gen.        | ●|
+| T.5 | Text Misleading VQA               | Accuracy ($\uparrow$)                   | Gen.        | ●|
+| T.6 | Visual Confusion VQA              | Accuracy ($\uparrow$)                   | Gen.        | ○|
+| T.7 | Visual Misleading QA              | Accuracy ($\uparrow$)                   | Dis.        | ●|
+| S.1 | Risk Identification               | Accuracy ($\uparrow$)                   | Dis.\&Gen.  | ◐|
+| S.2 | Image Description                 | Toxicity Score ($\downarrow$), RtA ($\uparrow$)  | Gen.        | ●|
+| S.3 | Toxicity Content Generation       | Toxicity Score ($\downarrow$), RtA ($\uparrow$)  | Gen.        | ◐|
+| S.4 | Plain Typographic Jailbreaking    | ASR ($\downarrow$), RtA ($\uparrow$)             | Gen.        | ◐|
+| S.5 | Optimized Multimodal Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$)             | Gen.        | ◐|
+| S.6 | Cross-modal Influence on Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$)          | Gen.        | ◐|
+| R.1 | VQA for Artistic Style images     | Score ($\uparrow$)                      | Gen.        | ◐|
+| R.2 | VQA for Sensor Style images       | Score ($\uparrow$)                      | Gen.        | ●|
+| R.3 | Sentiment Analysis for OOD texts  | Accuracy ($\uparrow$)                   | Dis.        | ○|
+| R.4 | Image Captioning under Untarget attack | Accuracy ($\uparrow$)               | Gen.        | ◐|
+| R.5 | Image Captioning under Target attack | Attack Success Rate ($\downarrow$)    | Gen.        | ◐|
+| R.6 | Textual Adversarial Attack        | Accuracy ($\uparrow$)                   | Dis.        | ○|
+| F.1 | Stereotype Content Detection      | Containing Rate ($\downarrow$)          | Gen.        | ●|
+| F.2 | Agreement on Stereotypes          | Agreement Percentage ($\downarrow$)     | Dis.        | ◐|
+| F.3 | Classification of Stereotypes     | Accuracy ($\uparrow$)                   | Dis.        | ○|
+| F.4 | Stereotype Query Test             | RtA ($\uparrow$)                        | Gen.        | ◐|
+| F.5 | Preference Selection in VQA       | RtA ($\uparrow$)                        | Gen.        | ●|
+| F.6 | Profession Prediction             | Pearson’s correlation ($\uparrow$)      | Gen.        | ◐|
+| F.7 | Preference Selection in QA        | RtA ($\uparrow$)                        | Gen.        | ●|
+| P.1 | Visual Privacy Recognition        | Accuracy, F1 ($\uparrow$)               | Dis.        | ○|
+| P.2 | Privacy-sensitive QA Recognition  | Accuracy, F1 ($\uparrow$)               | Dis.        | ○|
+| P.3 | InfoFlow Expectation              | Pearson's Correlation ($\uparrow$)      | Gen.        | ○|
+| P.4 | PII Query with Visual Cues        | RtA ($\uparrow$)                        | Gen.        | ◐|
+| P.5 | Privacy Leakage in Vision         | RtA ($\uparrow$), Accuracy ($\uparrow$) | Gen.        | ◐|
+| P.6 | PII Leakage in Conversations      | RtA ($\uparrow$) | Gen.        | ◐|
+
+### ⚛️ Overall Results 
+- Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.
+- A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.
+- Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.
 
-### 📌 Overall Results 
 ![result](docs/structure/overall.png)