update Readme

UniModal4Reasoning · Jun 5, 2024 · 2365e8a · 2365e8a
1 parent 4f1509a
commit 2365e8a
Showing 1 changed file with 28 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -6,13 +6,18 @@
 
 Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: 
 
-- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes. 
+- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. 
 - 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 
 - 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.  
 - 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. 
 
 Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.
 
+<div align=center>
+<img src="assets/motivation.png" height="95%">
+</div>
+
+
 ## Release
 
 - [2024/6/10] 🔥 Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]()
@@ -26,26 +31,24 @@ Besides, based on DocGenome, we conduct extensive experiments to demonstrate the
     - [docgenome-train-006.tar.gz]()
     - [docgenome-train-007.tar.gz]()
 
-<div align=center>
-<img src="assets/motivation.png" height="95%">
-</div>
-
+&ensp;
+------------------------
 
 
 ## DocGenome Benchmark Introduction
 
 | Datasets                | \# Discipline | \# Category of Units  | \# Pages in Train-set       | \# Pages in Test-set | \# Task    | \# Used Metric | Publication | Entity Relations          |
-|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|
+|------------|--------------|--------------|-------------------------------|------------------------|------------|----------------|-------------|----------------|
 |                                          |                      
-| DocVQA         | -                              | N/A             | 11K                | 1K           | 1          | 2                  | 1960-2000   | ❎     |
-| DocLayNet | -                              | 11              | 80K                | 8K           | 1          | 1                  | -           | ❎     |
-| DocBank            | -                              | 13              | 0.45M              | **50K** | 3          | 1                  | 2014-2018   | ❎     |
-| PubLayNet   | -                              | 5               | 0.34M              | 12K          | 1          | 1                  | -           | ❎     |
-| VRDU               | -                              | 10              | 7K                 | 3K           | 3          | 1                  | -           | ❎     |
-| DUDE             | -                              | N/A             | 20K                | 6K           | 3          | 3                  | 1860-2022   | ❎     |
-| D^4LA             | -                              | **27**    | 8K                 | 2K           | 1          | 3                  | -           | ❎     |
-| Fox Benchmark       | -                              | 5               | N/A (No train-set) | 0.2K         | 3          | 5                  | -           | ❎     |
-| ArXivCap        | 32                             | N/A             | 6.4M*           | N/A          | 4          | 3                  | -           | ❎    |
+| [DocVQA](https://arxiv.org/abs/2007.00398)         | -                              | N/A             | 11K                | 1K           | 1          | 2                  | 1960-2000   | ❎     |
+| [DocLayNet](https://arxiv.org/abs/2206.01062) | -                              | 11              | 80K                | 8K           | 1          | 1                  | -           | ❎     |
+| [DocBank](https://arxiv.org/abs/2006.01038)            | -                              | 13              | 0.45M              | **50K** | 3          | 1                  | 2014-2018   | ❎     |
+| [PubLayNet](https://arxiv.org/abs/1908.07836)   | -                              | 5               | 0.34M              | 12K          | 1          | 1                  | -           | ❎     |
+| [VRDU](https://arxiv.org/abs/1908.07836)               | -                              | 10              | 7K                 | 3K           | 3          | 1                  | -           | ❎     |
+| [DUDE](https://arxiv.org/abs/2305.08455)             | -                              | N/A             | 20K                | 6K           | 3          | 3                  | 1860-2022   | ❎     |
+| [D^4LA](https://arxiv.org/abs/2308.14978)             | -                              | **27**    | 8K                 | 2K           | 1          | 3                  | -           | ❎     |
+| [Fox Benchmark](https://arxiv.org/abs/2405.14295)       | -                              | 5               | N/A (No train-set) | 0.2K         | 3          | 5                  | -           | ❎     |
+| [ArXivCap](https://arxiv.org/abs/2403.00231)        | 32                             | N/A             | 6.4M*           | N/A          | 4          | 3                  | -           | ❎    |
 | DocGenome (ours)                | **153**                   | 13              | **6.8M**      | 9K           | **7** | **7**         | 2007-2022   | ✅     |
 
 
@@ -86,7 +89,7 @@ DocGenome contains 4 level relation types and 2 cite relation types, as shown in
 ### Attribute of component units
 DocGenome has 13 attributes of component units, which can be categorized into two classes
 - **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
-- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.
+- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \ref and \label.
 
 | **Index**  | **Category** | **Notes**                           |
 |----------------|-------------------|------------------------------------------|
@@ -104,7 +107,7 @@ DocGenome has 13 attributes of component units, which can be categorized into tw
 | 13             | Code              |                                          |
 | 14             | Abstract          |                                          |
 
-
+**Note that** we do not use the “others” category and the “reference” category, and their indices are 6 and 11, respectively.
 
 ## Types of disciplines
 
@@ -131,18 +134,23 @@ Distribution of secondary disciplines in our DocGenome. The count on the x-axis
 
 
 
-
-
 &ensp;
 ------------------------
 ## DocParser: A Cutting-edge Auto-labeling Pipeline
+**Schematic of the designed DocParser pipeline for automated document annotation** The process is divided into four distinct stages: 
+- 1) Data Preprocessing, 
+- 2) Unit Segmentation, 
+- 3) Attribute Assignment and Relation Retrieval, 
+- 4) Color Rendering. 
+
+DocParser can convert LaTeX source code of a complete document into annotations for component units with source-code, attributes, relationships and bounding box, as well as a rendered PNG of the entire document.
+
 
 <div align=center>
 <img src="assets/auto_label_pipeline.png" height="85%">
 </div>
 
 
-
 ## Visualizations
 
 <details>