-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Interaction term between two correlated comparisons #2413
Comments
I think you can do this already using this kind of syntax: import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
df = splink_datasets.fake_1000
df["postcode"] = df["email"].str.slice(0, 3)
# Define custom comparison for postcode and city
postcode_city_comparison = cl.CustomComparison(
output_column_name="postcode_city",
comparison_levels=[
cll.And(cll.NullLevel("postcode"), cll.NullLevel("postcode")),
{
"sql_condition": "postcode_l = postcode_r AND city_l = city_r",
"label_for_charts": "Exact match on both postcode and city",
},
cll.ExactMatchLevel("postcode").configure(label_for_charts="Different city, exact match on postcode"),
cll.ExactMatchLevel("city").configure(label_for_charts="Different postcode, exact match on city"),
cll.ElseLevel(),
],
)
# Define settings
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
comparisons=[
cl.NameComparison("first_name"),
cl.NameComparison("surname"),
cl.DateOfBirthComparison("dob", input_is_string=True),
postcode_city_comparison,
],
max_iterations=5,
)
linker = Linker(df, settings, db_api=DuckDBAPI())
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("first_name", "surname")
)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
linker.visualisations.match_weights_chart() |
@RobinL's solution is equivalent to method 1 in S4 of the appendix of the fastLink paper, and works great for some use-cases, but doesn't allow e.g. specifying c1 * c2 and c2 * c3 without c1 * c2 * c3. For that, the second method in that appendix based on log-linear models is the solution, and there is a discussion of adding it to Splink here: #1310 |
Thank you for the response around using AND, I think I can incorporate it. However, one complexity that usually comes in practice is that comparisons usually have more than one comparison level. For example: postcode_comparison = [postcode_exact_match, postcode_area_match, postcode_sector_match]
address_comparison = [address_exact_match, street_name_match, address_fuzzy_match]
location_comparison = ??? I would need to make 9 levels (3 * 3) with AND, plus the other 3 + 3 levels, to define a location_comparison (so 3*3 + 3 + 3 levels in total). The challenging thing then is to find what is the right order for these 15 comparison levels, since ordering has a very significant effect. In my case, I actually have more that 3 levels, more like 6-8. Have you found yourself in this combinatorial explosion and then further ordering problem? |
Yeah - you're right to highlight these challenges. It's typically best to try and order in terms of 'better matches higher' - start with the most precise matches and work your way down. Although I appreciate it's not always obvious in practice; you may need some trial and error. I agree that the combinatorial explosion problem is real, but on a large dataset having (say) 9 comparison levels is totally fine. Ultimately, each one corresponds to two parameters to estimate, so 18 parameters is not very many at all (compared to, say, other ML approaches which can have thousands). In our production models we tend to have between about 2 and 10 comparison levels per comparison |
Is your proposal related to a problem?
Terms with independent comparisons, e.g. Postcode and City can be very correlated, so two independent comparisons for postcode & city will lead to overestimating match score when both match, or underestimating it when only one matches.
Describe the solution you'd like
Some mechanism to score the interaction between two comparisons (usually negatively, like in term frequency).
Describe alternatives you've considered
So far I have put city as a lower comparison level to postcode, but I expect this problem of correlated comparisons to be more general.
Ordering of levels is also very sensitive to the precision of postcodes (e.g. UK postcode vs 5 digit US zip code). So an interaction would make the model less "hand-tuned" due to manual ordering of levels.
Additional context
Creating an interaction term is a common mechanism in dealing with correlation in Machine Learning, e.g.
if x1 and x2 are correlated, you can add a term x1*x2 in your model, e.g.
y = a*x1+b*x2+c*x1*x2+d
The text was updated successfully, but these errors were encountered: