Skip to content

Commit

Permalink
update Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
BOBrown committed Jun 5, 2024
1 parent 4f1509a commit 2365e8a
Showing 1 changed file with 28 additions and 20 deletions.
48 changes: 28 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,18 @@

Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics:

- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes.
- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes.
- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document.
- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team.

Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.

<div align=center>
<img src="assets/motivation.png" height="95%">
</div>


## Release

- [2024/6/10] 🔥 Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]()
Expand All @@ -26,26 +31,24 @@ Besides, based on DocGenome, we conduct extensive experiments to demonstrate the
- [docgenome-train-006.tar.gz]()
- [docgenome-train-007.tar.gz]()

<div align=center>
<img src="assets/motivation.png" height="95%">
</div>

&ensp;
------------------------


## DocGenome Benchmark Introduction

| Datasets | \# Discipline | \# Category of Units | \# Pages in Train-set | \# Pages in Test-set | \# Task | \# Used Metric | Publication | Entity Relations |
|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|
|------------|--------------|--------------|-------------------------------|------------------------|------------|----------------|-------------|----------------|
| |
| DocVQA | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 ||
| DocLayNet | - | 11 | 80K | 8K | 1 | 1 | - ||
| DocBank | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 ||
| PubLayNet | - | 5 | 0.34M | 12K | 1 | 1 | - ||
| VRDU | - | 10 | 7K | 3K | 3 | 1 | - ||
| DUDE | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 ||
| D^4LA | - | **27** | 8K | 2K | 1 | 3 | - ||
| Fox Benchmark | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - ||
| ArXivCap | 32 | N/A | 6.4M* | N/A | 4 | 3 | - ||
| [DocVQA](https://arxiv.org/abs/2007.00398) | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 ||
| [DocLayNet](https://arxiv.org/abs/2206.01062) | - | 11 | 80K | 8K | 1 | 1 | - ||
| [DocBank](https://arxiv.org/abs/2006.01038) | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 ||
| [PubLayNet](https://arxiv.org/abs/1908.07836) | - | 5 | 0.34M | 12K | 1 | 1 | - ||
| [VRDU](https://arxiv.org/abs/1908.07836) | - | 10 | 7K | 3K | 3 | 1 | - ||
| [DUDE](https://arxiv.org/abs/2305.08455) | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 ||
| [D^4LA](https://arxiv.org/abs/2308.14978) | - | **27** | 8K | 2K | 1 | 3 | - ||
| [Fox Benchmark](https://arxiv.org/abs/2405.14295) | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - ||
| [ArXivCap](https://arxiv.org/abs/2403.00231) | 32 | N/A | 6.4M* | N/A | 4 | 3 | - ||
| DocGenome (ours) | **153** | 13 | **6.8M** | 9K | **7** | **7** | 2007-2022 ||


Expand Down Expand Up @@ -86,7 +89,7 @@ DocGenome contains 4 level relation types and 2 cite relation types, as shown in
### Attribute of component units
DocGenome has 13 attributes of component units, which can be categorized into two classes
- **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.
- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \ref and \label.

| **Index** | **Category** | **Notes** |
|----------------|-------------------|------------------------------------------|
Expand All @@ -104,7 +107,7 @@ DocGenome has 13 attributes of component units, which can be categorized into tw
| 13 | Code | |
| 14 | Abstract | |


**Note that** we do not use the “others” category and the “reference” category, and their indices are 6 and 11, respectively.

## Types of disciplines

Expand All @@ -131,18 +134,23 @@ Distribution of secondary disciplines in our DocGenome. The count on the x-axis





&ensp;
------------------------
## DocParser: A Cutting-edge Auto-labeling Pipeline
**Schematic of the designed DocParser pipeline for automated document annotation** The process is divided into four distinct stages:
- 1) Data Preprocessing,
- 2) Unit Segmentation,
- 3) Attribute Assignment and Relation Retrieval,
- 4) Color Rendering.

DocParser can convert LaTeX source code of a complete document into annotations for component units with source-code, attributes, relationships and bounding box, as well as a rendered PNG of the entire document.


<div align=center>
<img src="assets/auto_label_pipeline.png" height="85%">
</div>



## Visualizations

<details>
Expand Down

0 comments on commit 2365e8a

Please sign in to comment.