-
Notifications
You must be signed in to change notification settings - Fork 1
SchemaMatching
Schema Matching is the process of finding correspondences between the attributes of two data sources. The result of schema matching is a mapping that allows the transformation of one schema into the other. This page shows examples for the different pre-implemented methods introduced in the Matching Section.
Label-based schema matching aligns the attributes of two datasets based on their names (labels).
scifi1.csv:
"id","title","studio","movie genre","budget","gross","director","date"
...
scifi2.csv:
"id","film title","production studio","genre","cost","earned","dir.","released"
...
Load the two datasets from CSV files using the default model:
// load data
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
new CSVRecordReader(0).loadFromCSV(new File("scifi1.csv"), data1);
DataSet<Record, Attribute> data2 = new HashedDataSet<>();
new CSVRecordReader(0).loadFromCSV(new File("scifi2.csv"), data2);
Initialise the matching engine and run the label-based schema matching:
// Initialize Matching Engine
MatchingEngine<Record, Attribute> engine = new MatchingEngine<>();
// run the matching
Processable<Correspondence<Attribute, Record>> correspondences
= engine.runLabelBasedSchemaMatching(data1.getSchema(), data2.getSchema(), new LabelComparatorJaccard(), 0.5);
The result is the alignment/mapping represented by correspondences between the attributes:
// print results
for(Correspondence<Attribute, Record> cor : correspondences.get()) {
System.out.println(String.format("'%s' <-> '%s' (%.4f)",
cor.getFirstRecord().getName(),
cor.getSecondRecord().getName(),
cor.getSimilarityScore()));
}
Output:
'title' <-> 'film title' (0,5000)
'studio' <-> 'production studio' (0,5000)
'id' <-> 'id' (1,0000)
'movie genre' <-> 'genre' (0,5000)
Instance-based schema matching aligns the attributes of two datasets based on their values.
Load two datasets from CSV files using the default model:
// load data
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
new CSVRecordReader(-1).loadFromCSV(new File("usecase/movie/input/scifi1.csv"), data1);
DataSet<Record, Attribute> data2 = new HashedDataSet<>();
new CSVRecordReader(-1).loadFromCSV(new File("usecase/movie/input/scifi2.csv"), data2);
Define a blocker that uses the attribute values to create pairs potentially matching attributes.
// define a blocker that uses the attribute values to generate pairs
InstanceBasedSchemaBlocker<Record, Attribute, MatchableValue> blocker
= new InstanceBasedSchemaBlocker<>(
new AttributeValueGenerator(data1.getSchema()),
new AttributeValueGenerator(data2.getSchema()));
Define an aggregator that calculates a similarity score based on all matching values between an attribute combination
// to calculate the similarity score, aggregate the pairs by counting
// and normalise with the number of record in the smaller dataset
// (= the maximum number of records that can match)
VotingAggregator<Attribute, MatchableValue> aggregator
= new VotingAggregator<>(
false,
Math.min(data1.size(), data2.size()),
0.0);
Run the instance-based schema matching via the matching engine
// Initialize Matching Engine
MatchingEngine<Record, Attribute> engine = new MatchingEngine<>();
// run the matching
Processable<Correspondence<Attribute, MatchableValue>> correspondences
= engine.runInstanceBasedSchemaMatching(data1, data2, blocker, aggregator);
Finally, print the results to the console
// print results
for(Correspondence<Attribute, MatchableValue> cor : correspondences.get()) {
System.out.println(String.format("'%s' <-> '%s' (%.4f)",
cor.getFirstRecord().getName(),
cor.getSecondRecord().getName(),
cor.getSimilarityScore()));
}
Output:
'budget' <-> 'cost' (0,0500)
'id' <-> 'id' (0,7500)
'director' <-> 'dir.' (0,7500)
'gross' <-> 'earned' (0,0500)
'title' <-> 'film title' (0,7500)
'gross' <-> 'cost' (0,0500)
'date' <-> 'released' (0,7000)
'budget' <-> 'earned' (0,0500)
Duplicate-based schema matching compares the attribute values of known duplicates in the two data sources to infer the attribute mapping. This requires a set of instance correspondences, which can be created by Identity Resolution].
Load two datasets with different schemas and overlapping records:
// load data
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
new CSVRecordReader().loadFromCSV(new File("usecase/movie/input/scifi1.csv"), data1);
DataSet<Record, Attribute> data2 = new HashedDataSet<>();
new CSVRecordReader().loadFromCSV(new File("usecase/movie/input/scifi2.csv"), data2);
Load a set of duplicates:
// load duplicates
Processable<Correspondence<Record, Attribute>> duplicates
= Correspondence.loadFromCsv(
new File("scifi1_2_scifi2_instance_correspondences.csv"),
data1,
data2);
Define the rule for duplicate-based schema matching. Here, the similarity function only accepts exact matches.
// define the schema matching rule
SchemaMatchingRuleWithVoting<Record, Attribute, Attribute> schemaRule
= new DuplicateBasedSchemaMatchingRule<>(
(a1,a2,c) -> {
// get both attribute values
String value1 = c.getFirstRecord().getValue(a1);
String value2 = c.getSecondRecord().getValue(a2);
// check if they are equal
if(value1!=null && value2!=null && value1.equals(value2)) {
return 1.0;
} else {
return 0.0;
}
}
, 1.0);
Run the matching with the just defined rule:
// Initialize Matching Engine
MatchingEngine<Record, Attribute> engine = new MatchingEngine<>();
// Execute the matching
Result<Correspondence<Attribute, Record>> correspondences = engine.runDuplicateBasedSchemaMatching(data1.getSchema(), data2.getSchema(), duplicates, schemaRule);
And print the results to the console:
// print results
for(Correspondence<Attribute, Record> cor : correspondences.get()) {
System.out.println(
String.format("'%s' <-> '%s' (%.4f)",
cor.getFirstRecord().getName(),
cor.getSecondRecord().getName(),
cor.getSimilarityScore()));
}
Output:
'director' <-> 'dir.' (1,0000)
'title' <-> 'film title' (1,0000)
'id' <-> 'id' (1,0000)
'date' <-> 'released' (1,0000)