Skip to content

Curated list of awesome datasets for various table understanding tasks

License

Notifications You must be signed in to change notification settings

esborisova/Awesome-Table-Understanding-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Table Understanding Datasets drawing

Awesome

A curated list of datasets that can be directly used or adapted for various table understanding tasks.

Note that some of the datasets provide only metadata and annotations, without the source files. Additionally, several datasets have download links that are no longer active. Since the authors may resolve this issue in the future, these datasets are still included in the list.

The repository will be continuously updated ✏️. If you find this resource useful for your research, just ⭐️ it and stay tuned!

Dataset Source Task(s) Size Modality
PubTables-1M
  • Scholary papers from PubMed
  • Table detection
  • Table structure recognition
  • Functional analysis
947.64K tables Image
SciGen
  • Scholary papers from arXiv
  • Table-to-text generation
1.3K table-text description pairs Text
ComTQA
  • Scholary papers from PubMed
  • Financial reports of S&P 500 companies
  • Question answering
1.5K tables and 9K QA pairs Image
DocGenom
  • Scholary papers from arXiv
  • Table-to-LaTeX generation
3K table-LaTeX pairs Image
numericNLG
  • Scholary papers from ACL Anthology
  • Text-to-table generation
1.3K text-table pairs Text
SEM-TAB-FACTS
  • Scholary papers from Elsevier
  • Statement fact verification
  • Cell evidence selection
3K tables Text
TAT-QA
  • Annual reports
  • Question answering
2K hybrid contexts (tables and text) and 16.5K QA pairs Text
WikiBio
  • Wikipedia
  • Biography generation
728.32K biographies Text
ToTTo
  • Wikipedia
  • Table-to-text
120K table-text pairs Text
TabFact
  • Wikipedia
  • Fact-checking
16K tables and 118K statements Text
TableBench
  • Wikipedia
  • Earnings reports of S&P 500 companies
  • Question answering
3.6K tables and 886 QA pairs Text
TableInstruct
  • Wikipedia
  • Earnings reports of S&P 500 companies
  • Question answering
3.6K tables and 20K QA pairs Text
FinQA
  • Earnings reports of S&P 500 companies
  • Question answering
8.2K QA pairs Text
LogicNLG
  • Wikipedia
  • Logical natural language generation
7.3K tables Text
TabIS
  • Wikipedia
  • Statistical reports from Statistics Canada and National Science Foundation
  • Information seeking from tables
61K tables Text
DataBench
  • Forbes
  • Kaggle
  • Graphext
  • City of New York
  • US Gov
  • Inside Airbnb
  • Data World
  • AEMET
  • INE
  • TrustPilot
  • World Happiness
  • Brown University
  • US Census
  • X
  • SBA
  • Spotify
  • BigQuery
  • CIS
  • Brandwatch
  • DataMarket
  • UCI ML
  • Kern et al, PNAS’20
  • Question answering
56K tables Text
GitTables
  • GitHub
  • Semantic column type detection
  • Schema completion
1M tables Text
AxCell: Segmented Tables
  • Scholary papers from arXiv
  • Table segmentation
  • Table type classification
1.9K tables Text
WDC Web Table Corpus 2012
  • Common Crawl
  • Data search
  • Table extension/completion
  • Knowledge base construction
  • Table matching
  • NLP tasks
147M tables Text
WDC Web Table Corpus 2015
  • Common Crawl
  • Data search
  • Table extension/completion
  • Knowledge base construction
  • Table matching
  • NLP tasks
233M tables Text
T2D
  • Common Crawl
  • Matching web tables to DBpedia
1.7K tables Text
T2Dv2
  • Common Crawl
  • Matching web tables to DBpedia
779 tables Text
WikiTables
  • Wikipedia
  • Entity linking
1.6M tables Text
WikiTableQuestions
  • Wikipedia
  • Question answering
2.1K tables and 22K QA pairs Text
WikiSQL
  • Wikipedia
  • Text-to-SQL/Question answering
24.2K tables Text
Spider 1.0
  • College database courses
  • DatabaseAnswers
  • Wikipedia
  • Text-to-SQL/Question answering
N/A Text
OTT-QA
  • Wikipedia
  • Question answering
400K tables Text
HybridQA
  • Wikipedia
  • Question answering
13K tables and 70K QA pairs Text
FEVEROUS
  • Wikipedia
  • Fact extraction and verification
87K claims Text
TableBank
  • Word documents from the internet
  • LaTex documents from arXiv
  • Table detection and recognition
417K tables Image
PubTabNet
  • Scholary papers from PubMed
  • Table detection and recognition
568K tables Image
PubLayNet
  • Scholary papers from PubMed
  • Document layout recognition
94K pages with tables and 113K tables Image
FinTabNet
  • Earnings reports of S&P 500 companies
  • Table structure recognition
89K pages and 112.8K tables Text
WTW
  • Images from natural scenes
  • Archival document images
  • Printed document images
  • Table structure recognition
14.5K tables Image
SciTSR
  • Scholary papers from arXiv
  • Table structure recognition
15K tables Image
TNCR
  • Web
  • Table detection
  • Table classification
6.6K images and 9.4K tables Image
DeepFigures
  • Scholary papers from arXiv and PubMed
  • Table extraction
1.4M tables Text
WikiTableSet
  • Wikipedia
  • Table recognition
5M tables Image
Tab2Know
  • Scholary papers from AAAI, ACL, Artif. Intell., arXiv, CIKM, COLING, CoNLL, EACL, ECAI, EMNLP, HLT-NAACL, IJCAI, ISWC, NeurIPS, NIPS, PVLDB, VLDB, and WWW
  • Table-to-knowledge base
73k tables Image and text
Logic2Text
  • Wikipedia
  • Natural language generation
5.6K tables and 10.8k (logical form, description) pairs Text
SQA
  • Wikipedia
  • Question answering
17.5K QA pairs Text
FeTaQA
  • Wikipedia
  • Question answering
10.3K (table, question, answer, table cells) pairs Text
ICDAR 2019 cTDaR
  • Modern and archival documents
  • Table detection
N/A Image
SportsTables
  • Web
  • Semantic type detection
1.1K tables Text
SemTab2019
  • T2Dv2
  • Wikipedia
  • Syntheticly generated tables
  • Tabular data to knowledge graph matching
14.9K tables Text
Tough Tables (2T)
  • Wikipedia
  • Web
  • Syntheticly generated tables
  • Tabular data to knowledge graph matching
180 tables Text
SemTab2020
  • Tough Tables
  • Syntheticly generated tables
  • Tabular data to knowledge graph matching
131.4K tables Text
HardTables
  • N/A
  • Tabular data to knowledge graph matching
N/A Text
BiodivTab
  • BExIS6
  • BEFChina
  • data.world
  • Tabular data to knowledge graph matching
N/A Text
BioTable
  • N/A
  • Semantic table annotation
N/A Text
SemTab2021
  • HardTables
  • Tough Tables
  • BioTable
  • BiodivTab
  • GitTables
  • Tabular data to knowledge graph matching
9.1K tables Text
SemTab2022
  • HardTables
  • Tough Tables
  • BiodivTab
  • GitTables
  • Tabular data to knowledge graph matching
N/A Text
NumDB
  • DBpedia
  • Semantic labeling
389 tables Text
MammoTab
  • Wikipedia
  • Cell/mentions to KG entity matching (CEA)
  • Column to KG class matching (CTA)
980K tables Text
SOTAB
  • Schema.org Table Corpus
  • Column type annotation (CTA)
  • Column property annotation (CPA)
107K tables Text
Wikary
  • Wikipedia
  • Tabular data to knowledge graph matching
32K tables Text
HiTab
  • Wikipedia
  • Statistical reports from Statistics Canada and National Science Foundation
  • Question answering
  • Table-to-text generation
3.5K tables Text
INFOTABS
  • Wikipedia
  • Natural language inference
2.3K tables and 23.7K premise-hypothesis pairs Text
Rotowire
  • RotoWire website
  • Table-to-text generation
4.8K summaries Text
SBNation
  • SBNation website
  • Table-to-text generation
10.9K summaries Text
AIT-QA
  • Annual reports of S&P 500 companies
  • Question answering
515 questions and 116 tables Text
TabMWP
  • Math problems from an online learning website, IXL2
  • Question answering
37.6K tables Image and text
PubHealthTab
  • Wikipedia
  • Fact verification
1.9K claim-table pairs Text
MMTab
  • WTQ
  • FeTaQA
  • HiTab
  • AIT-QA
  • TabMCQ
  • TABMWP
  • TAT-QA
  • TabFact
  • InfoTabs
  • PubHealthTab
  • ToTTo
  • HiTab_T2T
  • Rotowire
  • WikiBIO
  • TSU
  • Question answering
  • Fact verification
  • Table-to-text generation
  • Table structure understanding
  • Table recognition
202K tables Image
CTE
  • Scholary papers from PubMed
  • Contextualized table extraction
75K pages and 35K tables Image
TD4CLTabs
  • Scholary papers from ACL Anthology
  • Table type classification
13.3K tables Image

🛠️ Contributing

Feel free to create a pull request or to open an issue if you would like to add other awesome datasets.

About

Curated list of awesome datasets for various table understanding tasks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published