diff --git a/README.md b/README.md index a58c577..78cfee2 100644 --- a/README.md +++ b/README.md @@ -6,13 +6,18 @@ Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: -- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their \LaTeX\ source codes. +- 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. - 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. - 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA. - 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark. +
+ +
+ + ## Release - [2024/6/10] πŸ”₯ Our paper entitled "DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models" has been released in arXiv [Link]() @@ -26,26 +31,24 @@ Besides, based on DocGenome, we conduct extensive experiments to demonstrate the - [docgenome-train-006.tar.gz]() - [docgenome-train-007.tar.gz]() -
- -
- +  +------------------------ ## DocGenome Benchmark Introduction | Datasets | \# Discipline | \# Category of Units | \# Pages in Train-set | \# Pages in Test-set | \# Task | \# Used Metric | Publication | Entity Relations | -|------------------------------------------|--------------------------------|-----------------|--------------------|--------------|------------|--------------------|-------------|-----------------| +|------------|--------------|--------------|-------------------------------|------------------------|------------|----------------|-------------|----------------| | | -| DocVQA | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 | ❎ | -| DocLayNet | - | 11 | 80K | 8K | 1 | 1 | - | ❎ | -| DocBank | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 | ❎ | -| PubLayNet | - | 5 | 0.34M | 12K | 1 | 1 | - | ❎ | -| VRDU | - | 10 | 7K | 3K | 3 | 1 | - | ❎ | -| DUDE | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 | ❎ | -| D^4LA | - | **27** | 8K | 2K | 1 | 3 | - | ❎ | -| Fox Benchmark | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - | ❎ | -| ArXivCap | 32 | N/A | 6.4M* | N/A | 4 | 3 | - | ❎ | +| [DocVQA](https://arxiv.org/abs/2007.00398) | - | N/A | 11K | 1K | 1 | 2 | 1960-2000 | ❎ | +| [DocLayNet](https://arxiv.org/abs/2206.01062) | - | 11 | 80K | 8K | 1 | 1 | - | ❎ | +| [DocBank](https://arxiv.org/abs/2006.01038) | - | 13 | 0.45M | **50K** | 3 | 1 | 2014-2018 | ❎ | +| [PubLayNet](https://arxiv.org/abs/1908.07836) | - | 5 | 0.34M | 12K | 1 | 1 | - | ❎ | +| [VRDU](https://arxiv.org/abs/1908.07836) | - | 10 | 7K | 3K | 3 | 1 | - | ❎ | +| [DUDE](https://arxiv.org/abs/2305.08455) | - | N/A | 20K | 6K | 3 | 3 | 1860-2022 | ❎ | +| [D^4LA](https://arxiv.org/abs/2308.14978) | - | **27** | 8K | 2K | 1 | 3 | - | ❎ | +| [Fox Benchmark](https://arxiv.org/abs/2405.14295) | - | 5 | N/A (No train-set) | 0.2K | 3 | 5 | - | ❎ | +| [ArXivCap](https://arxiv.org/abs/2403.00231) | 32 | N/A | 6.4M* | N/A | 4 | 3 | - | ❎ | | DocGenome (ours) | **153** | 13 | **6.8M** | 9K | **7** | **7** | 2007-2022 | βœ… | @@ -86,7 +89,7 @@ DocGenome contains 4 level relation types and 2 cite relation types, as shown in ### Attribute of component units DocGenome has 13 attributes of component units, which can be categorized into two classes - **1) Fixed-form units**, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser. -- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \texttt{\textbackslash ref} and \texttt{\textbackslash label}. +- **2) Floating-form units**, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \ref and \label. | **Index** | **Category** | **Notes** | |----------------|-------------------|------------------------------------------| @@ -104,7 +107,7 @@ DocGenome has 13 attributes of component units, which can be categorized into tw | 13 | Code | | | 14 | Abstract | | - +**Note that** we do not use the β€œothers” category and the β€œreference” category, and their indices are 6 and 11, respectively. ## Types of disciplines @@ -131,18 +134,23 @@ Distribution of secondary disciplines in our DocGenome. The count on the x-axis - -   ------------------------ ## DocParser: A Cutting-edge Auto-labeling Pipeline +**Schematic of the designed DocParser pipeline for automated document annotation** The process is divided into four distinct stages: +- 1) Data Preprocessing, +- 2) Unit Segmentation, +- 3) Attribute Assignment and Relation Retrieval, +- 4) Color Rendering. + +DocParser can convert LaTeX source code of a complete document into annotations for component units with source-code, attributes, relationships and bounding box, as well as a rendered PNG of the entire document. +
- ## Visualizations