diff --git a/README.md b/README.md
index a58c577..78cfee2 100644
--- a/README.md
+++ b/README.md
@@ -6,13 +6,18 @@
Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics:
-- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes.
+- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes.
- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document.
- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team.
Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.
+
+
+
+
+
## Release
- [2024/6/10] π₯ Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]()
@@ -26,26 +31,24 @@ Besides, based on DocGenome, we conduct extensive experiments to demonstrate the
- [docgenome-train-006.tar.gz]()
- [docgenome-train-007.tar.gz]()
-
-
-
-
+
+------------------------
## DocGenome Benchmark Introduction
| Datasets | \# Discipline | \# Category of Units | \# Pages in Train-set | \# Pages in Test-set | \# Task | \# Used Metric | Publication | Entity Relations |
-|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|
+|------------|--------------|--------------|-------------------------------|------------------------|------------|----------------|-------------|----------------|
| |
-| DocVQA | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 | β |
-| DocLayNet | - | 11 | 80K | 8K | 1 | 1 | - | β |
-| DocBank | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 | β |
-| PubLayNet | - | 5 | 0.34M | 12K | 1 | 1 | - | β |
-| VRDU | - | 10 | 7K | 3K | 3 | 1 | - | β |
-| DUDE | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 | β |
-| D^4LA | - | **27** | 8K | 2K | 1 | 3 | - | β |
-| Fox Benchmark | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - | β |
-| ArXivCap | 32 | N/A | 6.4M* | N/A | 4 | 3 | - | β |
+| [DocVQA](https://arxiv.org/abs/2007.00398) | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 | β |
+| [DocLayNet](https://arxiv.org/abs/2206.01062) | - | 11 | 80K | 8K | 1 | 1 | - | β |
+| [DocBank](https://arxiv.org/abs/2006.01038) | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 | β |
+| [PubLayNet](https://arxiv.org/abs/1908.07836) | - | 5 | 0.34M | 12K | 1 | 1 | - | β |
+| [VRDU](https://arxiv.org/abs/1908.07836) | - | 10 | 7K | 3K | 3 | 1 | - | β |
+| [DUDE](https://arxiv.org/abs/2305.08455) | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 | β |
+| [D^4LA](https://arxiv.org/abs/2308.14978) | - | **27** | 8K | 2K | 1 | 3 | - | β |
+| [Fox Benchmark](https://arxiv.org/abs/2405.14295) | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - | β |
+| [ArXivCap](https://arxiv.org/abs/2403.00231) | 32 | N/A | 6.4M* | N/A | 4 | 3 | - | β |
| DocGenome (ours) | **153** | 13 | **6.8M** | 9K | **7** | **7** | 2007-2022 | β
|
@@ -86,7 +89,7 @@ DocGenome contains 4 level relation types and 2 cite relation types, as shown in
### Attribute of component units
DocGenome has 13 attributes of component units, which can be categorized into two classes
- **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
-- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.
+- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \ref and \label.
| **Index** | **Category** | **Notes** |
|----------------|-------------------|------------------------------------------|
@@ -104,7 +107,7 @@ DocGenome has 13 attributes of component units, which can be categorized into tw
| 13 | Code | |
| 14 | Abstract | |
-
+**Note that** we do not use the βothersβ category and the βreferenceβ category, and their indices are 6 and 11, respectively.
## Types of disciplines
@@ -131,18 +134,23 @@ Distribution of secondary disciplines in our DocGenome. The count on the x-axis
-
-
------------------------
## DocParser: A Cutting-edge Auto-labeling Pipeline
+**Schematic of the designed DocParser pipeline for automated document annotation** The process is divided into four distinct stages:
+- 1) Data Preprocessing,
+- 2) Unit Segmentation,
+- 3) Attribute Assignment and Relation Retrieval,
+- 4) Color Rendering.
+
+DocParser can convert LaTeX source code of a complete document into annotations for component units with source-code, attributes, relationships and bounding box, as well as a rendered PNG of the entire document.
+
-
## Visualizations