Skip to content

Commit

Permalink
update Demop
Browse files Browse the repository at this point in the history
  • Loading branch information
BOBrown committed Jun 5, 2024
1 parent 74a382a commit 4f1509a
Show file tree
Hide file tree
Showing 6 changed files with 147 additions and 6 deletions.
153 changes: 147 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,73 @@

# DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: \textit{1) Completeness}: It is the first dataset to structure data from all modalities including 15 layout categories along with their LaTex source codes. \textit{2) Logicality}: It provides the logical relationships between different regions within each scientific document. \textit{3) Diversity}: It covers various document-oriented tasks, including document classification, visual grounding, document transformation, table QA, open-ended singe-page QA and multi-page QA. \textit{4) Correctness}: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.
Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics:

- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes.
- 2) Logicality: It provides 6 logical relationships between different entities within each scientific document.
- 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
- 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team.

Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.

## Release

- [2024/6/10] 🔥 Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]()
- [2024/6/6] 🔥 We have released the DocGenome benchmark, includes 8 subsets as follows:
- [docgenome-train-000.tar.gz]()
- [docgenome-train-001.tar.gz]()
- [docgenome-train-002.tar.gz]()
- [docgenome-train-003.tar.gz]()
- [docgenome-train-004.tar.gz]()
- [docgenome-train-005.tar.gz]()
- [docgenome-train-006.tar.gz]()
- [docgenome-train-007.tar.gz]()

<div align=center>
<img src="assets/motivation.png" height="95%">
</div>



## Relation definition
## DocGenome Benchmark Introduction

| Datasets | \# Discipline | \# Category of Units | \# Pages in Train-set | \# Pages in Test-set | \# Task | \# Used Metric | Publication | Entity Relations |
|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------|
| |
| DocVQA | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 ||
| DocLayNet | - | 11 | 80K | 8K | 1 | 1 | - ||
| DocBank | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 ||
| PubLayNet | - | 5 | 0.34M | 12K | 1 | 1 | - ||
| VRDU | - | 10 | 7K | 3K | 3 | 1 | - ||
| DUDE | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 ||
| D^4LA | - | **27** | 8K | 2K | 1 | 3 | - ||
| Fox Benchmark | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - ||
| ArXivCap | 32 | N/A | 6.4M* | N/A | 4 | 3 | - ||
| DocGenome (ours) | **153** | 13 | **6.8M** | 9K | **7** | **7** | 2007-2022 ||


&ensp;
------------------------

### 👇🏻DocGenome-train Download

We provide 8 subsets of DocGenome-train for downloading:

<details>
<summary> Data Download</summary>

- [docgenome-train-000.tar.gz]()
- [docgenome-train-001.tar.gz]()
- [docgenome-train-002.tar.gz]()
- [docgenome-train-003.tar.gz]()
- [docgenome-train-004.tar.gz]()
- [docgenome-train-005.tar.gz]()
- [docgenome-train-006.tar.gz]()
- [docgenome-train-007.tar.gz]()
</details>


### Definition of relationships between component units
DocGenome contains 4 level relation types and 2 cite relation types, as shown in the following table:

| **Name** | Description | Example |
Expand All @@ -24,18 +81,102 @@ DocGenome contains 4 level relation types and 2 cite relation types, as shown in
| Non-title adjacent | The two text or equation blocks are adjacent. | (Paragraph 1, Paragraph 2) |
| Explicitly-referred | One block refers to another block via footnote, reference, etc. | (As shown in \textbackslash ref\{Fig: 5\} ..., Figure 5) |
| Implicitly-referred | The caption block refers to the corresponding float environment. | (Table Caption 1, Table 1)
</details>

### Attribute of component units
DocGenome has 13 attributes of component units, which can be categorized into two classes
- **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}.

| **Index** | **Category** | **Notes** |
|----------------|-------------------|------------------------------------------|
| 0 | Algorithm | |
| 1 | Caption | Titles of Images, Tables, and Algorithms |
| 2 | Equation | |
| 3 | Figure | |
| 4 | Footnote | |
| 5 | List | |
| 7 | Table | |
| 8 | Text | |
| 9 | Text-EQ | Text block with inline equations |
| 10 | Title | Section titles |
| 12 | PaperTitle | |
| 13 | Code | |
| 14 | Abstract | |



## Region category definition
## Types of disciplines

Page distribution of DocGenome. 20\% of documents are five pages or fewer, 50\% are ten pages or fewer, and 80\% are nineteen pages or fewer.
<details>
<summary> Page Distribution</summary>
<div align=center>
<img src="assets/page_distribution.png" height="500">
</div>

</details>

&ensp;

Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.

<details>
<summary> Discipline Distribution</summary>
<div align=center>
<img src="assets/second_discipline.png" height="1000">
</div>

</details>

## Types of disciplines



# DocParser: A Cutting-edge Auto-labeling Pipeline

&ensp;
------------------------
## DocParser: A Cutting-edge Auto-labeling Pipeline

<div align=center>
<img src="assets/auto_label_pipeline.png" height="85%">
</div>
</div>



## Visualizations

<details>
<summary> Visual Example One of annotations in DocGenome</summary>

<div align=center>
<img src="assets/docgenome_label_examples_1.png" height="900">
</div>

</details>


<details>
<summary> Visual Example One of annotations in DocGenome</summary>

<div align=center>
<img src="assets/docgenome_label_examples_2.png" height="900">
</div>

</details>

<details>
<summary> Visual examples of document-oriented tasks in DocGenome</summary>

<div align=center>
<img src="assets/docgenome_task_examples.png" height="980">
</div>

</details>

## Citation
If you find our work useful in your research, please consider citing Fox:
```bibtex
@article{,
}
```
Binary file added assets/docgenome_label_examples_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/docgenome_label_examples_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/docgenome_task_examples.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/page_distribution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/second_discipline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4f1509a

Please sign in to comment.