Datasets for

Table Structure Recognition

🗒️List of Index

ICDAR2013
Marmot
PubTabNet
TableBank
SciTSR
WTW
ICDAR-2019
SynthTabNet
FinTab

ICDAR2013

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
-	-	156	Image from PDF		F1-score

Datasets for the ICDAR 2013 Table Competition. Includes a total of 150 tables in PDF format: 75 tables in 27 excerpts from the EU and 75 tables in 40 excerpts from the US Government. An automatic text conversion using pdftk has also been included for convenience.

Marmot

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
-	-	2000	PDF		F1-score

In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot". The dataset is composed of Chinese and English pages at the proportion of about 1:1.The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.The English pages were crawled from Citeseer website.The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications. The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.

PubTabNet

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
568k	-	-	PDF		TEDS

PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables. The table images are extracted from the scientific publications included in the PubMed Central Open Access Subset (commercial use collection). Table regions are identified by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

TableBank

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
145k	-	-	PDF		F1-score

Nowadays, there are a great number of electronic documents on the web such as Microsoft Word (.docx) and Latex (.tex) files. These online documents contain mark-up tags for tables in their source code by nature. Intuitively, we can manipulate these source code by adding bounding box using the mark-up language within each document. For Word documents, the internal Office XML code can be modified where the borderline of each table is identified. For Latex documents, the tex code can be also modified where bounding boxes of tables are recognized. In this way, high-quality labeled data is created for a variety of domains such as business documents, official fillings, research papers etc, which is tremendously beneficial for large-scale table analysis tasks.The TableBank dataset totally consists of 417,234 high quality labeled tables as well as their original documents in a variety of domains.

SciTSR

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
12k	-	3k	PDF		F1-score

SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.There are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in SciTSR-COMP.list.

WTW

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
14581	-	-	Photographing image		F1-score

WTW-Dataset is the first wild table dataset for table detection and table structure recongnition tasks, which is constructed from photoing, scanning and web pages, covers 7 challenging cases like: (1)Inclined tables, (2) Curved tables, (3) Occluded tables or blurredtables (4) Extreme aspect ratio tables (5) Overlaid tables, (6) Multi-color tables and (7) Irregular tables in table structure recognition.

ICDAR-2019

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
3.6K	-	-	Image from PDF		F1-score

Table is a compact and efficient form for summarizing and presenting correlative information in handwritten and printed archival documents, scientific journals, reports, financial statements and so on. Table recognition is fundamental for the extraction of information from structured documents. The ICDAR 2019 cTDaR evaluates two aspects of table analysis: table detection and recognition. The participating methods will be evaluated on a modern dataset and archival documents with printed and handwritten tables present.

SynthTabNet

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
600K	-	-	PDF		TEDS

SynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.

SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as png images and the annotations are in jsonl format.

FinTab

Number of Samples			Type	Access Link	Evaluation Metric
Train	Validate	Test	Type	Access Link	Evaluation Metric
1685	-	-	PDF		F1-score

This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, we use token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets_tsr.md

datasets_tsr.md

Datasets for

Table Structure Recognition

🗒️List of Index

ICDAR2013

Marmot

PubTabNet

TableBank

SciTSR

WTW

ICDAR-2019

SynthTabNet

FinTab

Files

datasets_tsr.md

Latest commit

History

datasets_tsr.md

File metadata and controls

Datasets for

Table Structure Recognition

🗒️List of Index

ICDAR2013

Marmot

PubTabNet

TableBank

SciTSR

WTW

ICDAR-2019

SynthTabNet

FinTab