This repository contains tools for training and evaluating Web table embedding with word embedding techniques. Those models can generate embeddings for schema terms and instance data terms making them especially useful for representing schema and class information as well as for ML tasks on tabular text data. Furthermore, this repository contains links to pre-trained web table models and the code for several tasks the models can be used for.
If you want to install the package to encode text (from tables) into embedding representations, you can run
pip install .
and load a pre-trained model as follows:
from table_embeddings import TableEmbeddingModel
model = TableEmbeddingModel.load_model('ddrg/web_table_embeddings_combo64')
embedding = model.get_header_vector('headline')
For installing all dependencies to run the evaluation tasks you can run:
pip install ".[full]"
This repository provides tools for training four different types of Web table embedding models: W-base, W-row, W-tax, and W-combo. For pre-training those embedding models the DWTC Web Table Corpus can be used. All modules required to run the python scripts in this repository can be installed via pip.
The training data used to be available on https://wwwdb.inf.tu-dresden.de/research-projects/dresden-web-table-corpus/ sIf you need the training data contact the university with the contact information you can find on this website.
The corpus can be downloaded as follows:
for i in $(seq -w 0 500); do wget http://wwwdb.inf.tu-dresden.de/misc/dwtc/data_feb15/dwtc-$i.json.gz -P data/; done
The DWTC dump can be filtered with embedding/filter_dump.py
and embedding/filter_columns.py
to create a dump containing only columns of English tables with a table header.
You may adjust the path of the DWTC corpus in config/dump_filter.json
.
python3 embedding/filter_dump.py -c config/dump_filter.json
python3 embedding/filter_columns.py -c config/column_filter.json
To train W-tax and W-combo embedding models, a header-data term graph needs to be constructed. First, an index file is constructed:
python3 embedding/build_index.py -i data/column_dump.json.gz -o data/indexes.json.gz
Afterward, the graph can be constructed:
python3 embedding/graph_generation.py -i data/indexes.json.gz -c config/header_data_graph_config.json
To run the actual embedding training, one can execute embedding/fasttext_web_table_embeddings.py
with one of the embedding configuration files in the config folder:
python3 embedding/fasttext_web_table_embeddings.py -c config/embedding_config_combo.json -o data/combo_model.bin -w
Below you can find links to models trained on the DWTC corpus:
Model Type | Description | Download-Links |
---|---|---|
W-tax | Model of relations between table header and table body | (64dim, 150dim) |
W-row | Model of row-wise relations in tables | (64dim, 150dim) |
W-combo | Model of row-wise relations and relations between table header and table body | (64dim, 150dim) |
W-plain | Model of row-wise relations in tables without pre-processing | (64dim, 150dim) |
To use the models, you can use the FastTextWebTableModel.load_model
function in embedding/fasttext_web_table_embeddings.py
.
Besides the embedding training, this repository contains the code of four evaluation tasks:
- Representation of instance-of relations found in YAGO (
yago_class_evaluation/
) - Unionable Table Search (
unionability_search/
) - Table layout classification on Web tables (
table_layout_classification/
) - Spreadsheet cell classification (
deco_classifier/
)
A detailed description, how to run the evaluation, is provided in the respective folders.
Pre-Trained Web Table Embeddings for Table Discovery
@inproceedings{gunther2021pre,
title={Pre-Trained Web Table Embeddings for Table Discovery},
author={G{\"u}nther, Michael and Thiele, Maik and Gonsior, Julius and Lehner, Wolfgang},
booktitle={Fourth Workshop in Exploiting AI Techniques for Data Management},
pages={24--31},
year={2021}
}