Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
- | - | 156 | Image from PDF | F1-score |
Datasets for the ICDAR 2013 Table Competition. Includes a total of 150 tables in PDF format: 75 tables in 27 excerpts from the EU and 75 tables in 40 excerpts from the US Government. An automatic text conversion using pdftk has also been included for convenience.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
- | - | 2000 | F1-score |
In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot". The dataset is composed of Chinese and English pages at the proportion of about 1:1.The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.The English pages were crawled from Citeseer website.The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications. The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
568k | - | - | TEDS |
PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables. The table images are extracted from the scientific publications included in the PubMed Central Open Access Subset (commercial use collection). Table regions are identified by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
145k | - | - | F1-score |
Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, we can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
12k | - | 3k | F1-score |
SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.There are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in SciTSR-COMP.list.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
14581 | - | - | Photographing image | F1-score |
WTW-Dataset is the first wild table dataset for table detection and table structure recongnition tasks, which is constructed from photoing, scanning and web pages, covers 7 challenging cases like: (1)Inclined tables, (2) Curved tables, (3) Occluded tables or blurredtables (4) Extreme aspect ratio tables (5) Overlaid tables, (6) Multi-color tables and (7) Irregular tables in table structure recognition.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
3.6K | - | - | Image from PDF | F1-score |
Table is a compact and efficient form for summarizing and presenting correlative information in handwritten and printed archival documents, scientific journals, reports, financial statements and so on. Table recognition is fundamental for the extraction of information from structured documents. The ICDAR 2019 cTDaR evaluates two aspects of table analysis: table detection and recognition. The participating methods will be evaluated on a modern dataset and archival documents with printed and handwritten tables present.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
600K | - | - | TEDS |
SynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.
SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as png images and the annotations are in jsonl format.
Number of Samples | Type | Access Link | Evaluation Metric | ||
---|---|---|---|---|---|
Train | Validate | Test | |||
1685 | - | - | F1-score |
This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.