update Demop

UniModal4Reasoning · Jun 5, 2024 · 4f1509a · 4f1509a
1 parent 74a382a
commit 4f1509a
Show file tree

Hide file tree

Showing 6 changed files with 147 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -4,16 +4,73 @@
 
 # DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models
 
-Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: \textit{1) Completeness}: It is the first dataset to structure data from all modalities including 15 layout categories along with their LaTex source codes. \textit{2) Logicality}: It provides the logical relationships between different regions within each scientific document. \textit{3) Diversity}: It covers various document-oriented tasks, including document classification, visual grounding, document transformation, table QA, open-ended singe-page QA and multi-page QA.  \textit{4) Correctness}: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.
+Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: 
 
+- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes. 
+- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 
+- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.  
+- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. 
+
+Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.
+
+## Release
+
+- [2024/6/10] 🔥 Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]()
+- [2024/6/6] 🔥 We have released the DocGenome benchmark, includes 8 subsets as follows: 
+    - [docgenome-train-000.tar.gz]()
+    - [docgenome-train-001.tar.gz]()
+    - [docgenome-train-002.tar.gz]()
+    - [docgenome-train-003.tar.gz]()
+    - [docgenome-train-004.tar.gz]()
+    - [docgenome-train-005.tar.gz]()
+    - [docgenome-train-006.tar.gz]()
+    - [docgenome-train-007.tar.gz]()
 
 <div align=center>
 <img src="assets/motivation.png" height="95%">
 </div>
 
 
 
-## Relation definition
+## DocGenome Benchmark Introduction
+
+| Datasets                | \# Discipline | \# Category of Units  | \# Pages in Train-set       | \# Pages in Test-set | \# Task    | \# Used Metric | Publication | Entity Relations          |
+|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|
+|                                          |                      
+| DocVQA         | -                              | N/A             | 11K                | 1K           | 1          | 2                  | 1960-2000   | ❎     |
+| DocLayNet | -                              | 11              | 80K                | 8K           | 1          | 1                  | -           | ❎     |
+| DocBank            | -                              | 13              | 0.45M              | **50K** | 3          | 1                  | 2014-2018   | ❎     |
+| PubLayNet   | -                              | 5               | 0.34M              | 12K          | 1          | 1                  | -           | ❎     |
+| VRDU               | -                              | 10              | 7K                 | 3K           | 3          | 1                  | -           | ❎     |
+| DUDE             | -                              | N/A             | 20K                | 6K           | 3          | 3                  | 1860-2022   | ❎     |
+| D^4LA             | -                              | **27**    | 8K                 | 2K           | 1          | 3                  | -           | ❎     |
+| Fox Benchmark       | -                              | 5               | N/A (No train-set) | 0.2K         | 3          | 5                  | -           | ❎     |
+| ArXivCap        | 32                             | N/A             | 6.4M*           | N/A          | 4          | 3                  | -           | ❎    |
+| DocGenome (ours)                | **153**                   | 13              | **6.8M**      | 9K           | **7** | **7**         | 2007-2022   | ✅     |
+
+
+&ensp;
+------------------------
+
+### 👇🏻DocGenome-train Download
+
+We provide 8 subsets of DocGenome-train for downloading:
+
+<details>
+<summary> Data Download</summary>
+
+- [docgenome-train-000.tar.gz]()
+- [docgenome-train-001.tar.gz]()
+- [docgenome-train-002.tar.gz]()
+- [docgenome-train-003.tar.gz]()
+- [docgenome-train-004.tar.gz]()
+- [docgenome-train-005.tar.gz]()
+- [docgenome-train-006.tar.gz]()
+- [docgenome-train-007.tar.gz]()
+</details>
+
+
+### Definition of relationships between component units
 DocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:
 
 | **Name**       | Description         | Example                 |
@@ -24,18 +81,102 @@ DocGenome contains 4 level relation types and 2 cite relation types, as shown in
 | Non-title adjacent  | The two text or equation blocks are adjacent.                    | (Paragraph 1, Paragraph 2)                                                 |
 | Explicitly-referred | One block refers to another block via footnote, reference, etc.  | (As shown in \textbackslash ref\{Fig: 5\} ..., Figure 5)                   |
 | Implicitly-referred | The caption block refers to the corresponding float environment. | (Table Caption 1, Table 1)           
+</details>
+
+### Attribute of component units
+DocGenome has 13 attributes of component units, which can be categorized into two classes
+- **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
+- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.
+
+| **Index**  | **Category** | **Notes**                           |
+|----------------|-------------------|------------------------------------------|
+| 0              | Algorithm         |                                          |
+| 1              | Caption           | Titles of Images, Tables, and Algorithms |
+| 2              | Equation          |                                          |
+| 3              | Figure            |                                          |
+| 4              | Footnote          |                                          |
+| 5              | List              |                                          |
+| 7              | Table             |                                          |
+| 8              | Text              |                                          |
+| 9              | Text-EQ           | Text block with inline equations         |
+| 10             | Title             | Section titles                           |
+| 12             | PaperTitle        |                                          |
+| 13             | Code              |                                          |
+| 14             | Abstract          |                                          |
 
 
 
-## Region category definition
+## Types of disciplines
+
+Page distribution of DocGenome. 20\% of documents are five pages or fewer, 50\% are ten pages or fewer, and 80\% are nineteen pages or fewer.
+<details>
+<summary> Page Distribution</summary>
+<div align=center>
+<img src="assets/page_distribution.png" height="500">
+</div>
 
+</details>
+
+&ensp;
+
+Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.
+
+<details>
+<summary> Discipline Distribution</summary>
+<div align=center>
+<img src="assets/second_discipline.png" height="1000">
+</div>
+
+</details>
 
-## Types of disciplines
 
 
 
-# DocParser: A Cutting-edge Auto-labeling Pipeline
+
+&ensp;
+------------------------
+## DocParser: A Cutting-edge Auto-labeling Pipeline
 
 <div align=center>
 <img src="assets/auto_label_pipeline.png" height="85%">
-</div>
+</div>
+
+
+
+## Visualizations
+
+<details>
+<summary> Visual Example One of annotations in DocGenome</summary>
+
+<div align=center>
+<img src="assets/docgenome_label_examples_1.png" height="900">
+</div>
+
+</details>
+
+
+<details>
+<summary> Visual Example One of annotations in DocGenome</summary>
+
+<div align=center>
+<img src="assets/docgenome_label_examples_2.png" height="900">
+</div>
+
+</details>
+
+<details>
+<summary> Visual examples of document-oriented tasks in DocGenome</summary>
+
+<div align=center>
+<img src="assets/docgenome_task_examples.png" height="980">
+</div>
+
+</details>
+
+## Citation
+If you find our work useful in your research, please consider citing Fox:
+```bibtex
+@article{,
+
+}
+```
diff --git a/assets/docgenome_label_examples_1.png b/assets/docgenome_label_examples_1.png
diff --git a/assets/docgenome_label_examples_2.png b/assets/docgenome_label_examples_2.png
diff --git a/assets/docgenome_task_examples.png b/assets/docgenome_task_examples.png
diff --git a/assets/page_distribution.png b/assets/page_distribution.png
diff --git a/assets/second_discipline.png b/assets/second_discipline.png