Tables on web pages ("web tables") cover a diversity of topics and can be a source of information for different tasks such as knowledge base augmentation or the ad-hoc extension of datasets. However, to use this information, the tables must first be integrated, either with each other or into existing data sources. The challenges that matching methods for this purpose have to overcome are the high heterogeneity and the small size of the tables.
To counter these problems, web tables from the same web site can be stitched before running any of the existing matching systems. This means that web tables are combined based on a schema mapping, which results in fewer and larger stitched tables.
This project contains the code for all methods used in "Stitching Web Tables for Improving Matching Quality" [1]. The version of the code that was used to run the experiments can be found in the "original_version" branch.
The complete web table stitching process consists of three steps:
- create union tables
scripts/create_union_tables
- deduplicate & discover functional dependencies
scripts/discover_functional_dependencies
- create stitched union tables
scripts/create_stitched_union_tables
To match the resulting stitched union tables to a knowledge base, see the T2K Match Project.
This project was developed at the Data and Web Science Group at the University of Mannheim using the WInte.r Framework.
The Web Table Stitching code can be used under the Apache 2.0 License.
[1] Lehmberg, Oliver and Bizer, Christian. "Stitching Web Tables for Improving Matching Quality" Proceedings of the VLDB Endowment - Preprint (2017).