Web tables have been used in many approaches to extend existing data sources with additional data. But so far, all approaches are limited to the extraction of binary relations. We find, however, that a large number of web tables actually are n-ary relations. This means that, in many cases, the data can only be understood if more than two columns of the web tables are considered.
The SNoW system identifies and synthesizes n-ary relations from web tables and integrates them with a knowledge base.
Given a set of web tables and a target knowledge base, the SNoW method extends each web table with additional context columns, stitches matching web tables into larger tables, and applies functional dependency discovery to identify the relations that are represented in the web tables. Further, it normalises the stitched tables, guided by the schema of the knowledge base, to create an integrated schema.
The SNoW system has been described and applied in [1,2], please cite these references if you use it in any publication.
The following figure visualises the integration process of the SNoW system. First, each web table is extended with context columns, which are generated from the web page that contains the web table. Then, web tables with identical schema are stitched into union tables. These union tables contain more tuples than the original web tables and improve the performance of schema matchers, which are executed in the next step. The schema matchers create a mapping among the union tables, to identify matching schemata which are expressed with different column headers, and a mapping from the union tables to the knowledge base. Based on the mapping among the union tables, the stitching is repeated to merge all union tables with matching schemata into stitched union tables. The stitched union tables contain the tuples of all web tables with the respective schema and enable the discovery of functional dependencies, which is infeasible on the much smaller original tables. Finally, all stitched union tables which are mapped to the same knowledge base class are merged into a stitched universal relation, which is normalised to create an integrated schema. This normalisation is guided by the mapping to the knowledge base and produces a star-shaped schema with the existing classes from the knowledge base at the centres and the n-ary relations that were discovered in the web tables referring to them.
As an example, consider the web table in the next figure, which contains employment statistics for a certain profession in several U.S. states at a given time. While a human can easily understand this, it is hard for an algorithm, as the table contains only very few examples (the figure shows the complete table) and important attributes, such as the profession and the date, are missing from the table. Existing approaches use recognisers or matchers to find column pairs which represent an existing or an unknown binary relation. In the example, such systems would extract a binary relation {state, employment}, which can be found in thousands of web tables from the same web site. As a result, the extracted binary relation contains about 2 000 different employment numbers for a single state and no possibility to choose a single, correct value, or to understand any individual value. To extract a meaningful relation for the employment attribute, we must include all attributes which determine its value, i.e., discover the functional dependency {state, page title, URI 1, URI 2}→{employment}. This is only possible if we take additional data from the context of the web table into account and combine the observations from multiple web tables, which contain employment numbers for different professions and dates.
The datasets and annotations used for evaluation can be found in the datasets/
directory.
Every sub-directory is a dataset for a single web site and contains the pre-processed union tables in union_dedup_json/
and the annotations in the evaluation/
sub-directory.
We run the SNoW system on our datasets in different configurations. The first two configurations do not apply any matchers and use the schema correspondences from the ground truth.
-
Binary detects an entity label column in each table and assumes all other attributes only depend on this column. This configuration uses the manually annotated schema mapping as input and extracts a binary relation for each reference relation and resembles a perfect extractor for binary relations as baseline.
-
N-Ary uses functional dependency discovery instead of binary relations, but does not generate context attributes.
-
N-Ary + C uses functional dependency discovery and generates context attributes.
-
Match: The SNoW system (N-ary + C), using the matching components instead of the correspondences from the annotations.
-
Norm: The end-to-end configuration of the SNoW system. Uses functional dependency discovery, generates context attributes, uses the matching components and applies normalisation.
Table 1: Dataset statistics and Evaluation. A_O=original attributes, A_T=total attributes (original & context attributes), A_U=universal attributes.
Web Tables | Union Tables | Annotation | Experiments - F1-measure | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Host | Tables | Tables | AO | AT | AU | FDs | Schema | Binary | N-Ary | N-ary +C | Match | Norm |
d3football.com | 40,584 | 12 | 63 | 145 | 41 | 18 | 0.779 | 0.495 | 0.480 | 0.851 | 0.765 | 0.765 |
data.bls.gov | 10,824 | 12 | 61 | 181 | 51 | 12 | 0.895 | 0.415 | 0.485 | 0.984 | 0.900 | 0.900 |
flightaware.com | 2,888 | 6 | 22 | 72 | 35 | 13 | 0.982 | 0.459 | 0.405 | 1.000 | 0.942 | 0.733 |
itunes.apple.com | 42,729 | 76 | 470 | 1,095 | 59 | 6 | 0.827 | 0.680 | 0.674 | 0.953 | 0.809 | 0.763 |
seatgeek.com | 157,578 | 72 | 266 | 714 | 71 | 30 | 0.999 | 0.988 | 0.992 | 0.984 | 0.962 | 0.962 |
www.amoeba.com | 5,529 | 65 | 227 | 712 | 42 | 13 | 0.933 | 0.812 | 0.394 | 0.907 | 0.885 | 0.885 |
www.cia.gov | 30,569 | 213 | 562 | 2,225 | 323 | 162 | 0.976 | 0.858 | 0.836 | 0.821 | 0.660 | 0.635 |
www.nndb.com | 23,522 | 29 | 123 | 299 | 29 | 10 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 0.999 |
www.vgchartz.com | 23,258 | 8 | 39 | 87 | 36 | 13 | 1.000 | 0.448 | 0.253 | 1.000 | 1.000 | 1.000 |
Sum / Macro Avg. | 337,481 | 493 | 1,833 | 5,530 | 687 | 277 | 0.932 | 0.684 | 0.613 | 0.944 | 0.880 | 0.849 |
After cloning the repository, use the unzip
script to extract the datasets:
./unzip
Before running snow, the paths in SET_VARS
must be adjusted for your environment:
J8="path to your java 8 executable"
JAR="path to the SNoW jar and its dependencies"
VMARGS="-Xmx300G"
TANE_HOME="path to the TANE implementation"
Before running SNoW, you must download the implementation of the TANE algorithm on your machine.
To run the snow system in different configurations, use one of the following scripts, which accept the path to the dataset as parameter:
./run_snow datasets/d3football.com
run_snow_b
: configuration Binaryrun_snow_noContextColumns
: configuration N-Aryrun_snow
: configuration N-Ary + Crun_snow_match
: configuration Matchrun_snow_match_normalised
: configuration Normrun_snow_reference
: Creates a reference extraction based on the annotations
To run the evaluation, use evaluate_containment
and provide the path to the dataset as parameter:
./evaluate_containment datasets/d3football.com
Or use evaluate_containment_all
to run the evaluation on all datasets and configurations.
To create new datasets, the clustered union tables must be created before running SNoW. Copy all web tables for a single web site in a directory and use create_clustered_union
to create the union tables:
./create_clustered_union directory_containing_a_single_web_site
If you want to create the union tables for multiple web sites, create a directory with one sub-directory for every web site and run run_clustered_union
with the directory as parameter:
./run_clustered_union directory_containing_all_web_sites
During the creation of the union tables, a directory clustered_union_correspondences
is created, which contains schema correspondences among the generated context columns. These correspondences must be copied into the evaluation
sub-directory of the dataset using the file name context_correspondences.tsv
.
[1] Oliver Lehmberg and Christian Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In 9th International Conference on Web Intelligence, Mining and Semantics (WIMS2019), June 26-28, 2019, Seoul, Republic of Korea. ACM, New York, NY, USA.
[2] Oliver Lehmberg and Christian Bizer. 2019. Profiling the Semantics of N-aryWeb Table Data. In The InternationalWorkshop on Semantic Big Data (SBD’19), July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA.