Skip to content

Latest commit

 

History

History
453 lines (384 loc) · 15.8 KB

datasets_tsr.md

File metadata and controls

453 lines (384 loc) · 15.8 KB

Datasets for

Table Structure Recognition

🗒️List of Index


ICDAR2013

License Adapt Share


Number of Samples Type Access Link Evaluation Metric
Train Validate Test
- - 156 Image from PDF

Link

F1-score

Link2

Datasets for the ICDAR 2013 Table Competition. Includes a total of 150 tables in PDF format: 75 tables in 27 excerpts from the EU and 75 tables in 40 excerpts from the US Government. An automatic text conversion using pdftk has also been included for convenience.


Marmot

License Adapt

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
- - 2000 PDF

Link

F1-score

In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot". The dataset is composed of Chinese and English pages at the proportion of about 1:1.The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.The English pages were crawled from Citeseer website.The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications. The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.


PubTabNet

License Share Research

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
568k - - PDF

Link

TEDS

Link2

PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables. The table images are extracted from the scientific publications included in the PubMed Central Open Access Subset (commercial use collection). Table regions are identified by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.


TableBank

License Commercial Research Share

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
145k - - PDF

Link

F1-score

Link2

Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, we can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.


SciTSR

License Commercial Research Share

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
12k - 3k PDF

Link

F1-score

Link2

SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.There are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in SciTSR-COMP.list.


WTW

License Commercial Adapt Share

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
14581 - - Photographing image

Link

F1-score

WTW-Dataset is the first wild table dataset for table detection and table structure recongnition tasks, which is constructed from photoing, scanning and web pages, covers 7 challenging cases like: (1)Inclined tables, (2) Curved tables, (3) Occluded tables or blurredtables (4) Extreme aspect ratio tables (5) Overlaid tables, (6) Multi-color tables and (7) Irregular tables in table structure recognition.


ICDAR-2019

License Adapt Share


Number of Samples Type Access Link Evaluation Metric
Train Validate Test
3.6K - - Image from PDF

Link

F1-score

Link2

Table is a compact and efficient form for summarizing and presenting correlative information in handwritten and printed archival documents, scientific journals, reports, financial statements and so on. Table recognition is fundamental for the extraction of information from structured documents. The ICDAR 2019 cTDaR evaluates two aspects of table analysis: table detection and recognition. The participating methods will be evaluated on a modern dataset and archival documents with printed and handwritten tables present.


SynthTabNet

License Research

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
600K - - PDF

Link

TEDS

SynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.

SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as png images and the annotations are in jsonl format.


FinTab

License

Number of Samples Type Access Link Evaluation Metric
Train Validate Test
1685 - - PDF

Link

F1-score

This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.