[FEAT] Interaction term between two correlated comparisons #2413

V-Lamp · 2024-09-18T09:17:39Z

Is your proposal related to a problem?

Terms with independent comparisons, e.g. Postcode and City can be very correlated, so two independent comparisons for postcode & city will lead to overestimating match score when both match, or underestimating it when only one matches.

Describe the solution you'd like

Some mechanism to score the interaction between two comparisons (usually negatively, like in term frequency).

Describe alternatives you've considered

So far I have put city as a lower comparison level to postcode, but I expect this problem of correlated comparisons to be more general.
Ordering of levels is also very sensitive to the precision of postcodes (e.g. UK postcode vs 5 digit US zip code). So an interaction would make the model less "hand-tuned" due to manual ordering of levels.

Additional context

Creating an interaction term is a common mechanism in dealing with correlation in Machine Learning, e.g.
if x1 and x2 are correlated, you can add a term x1*x2 in your model, e.g. y = a*x1+b*x2+c*x1*x2+d

The text was updated successfully, but these errors were encountered:

RobinL · 2024-09-18T09:51:17Z

I think you can do this already using this kind of syntax:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

df = splink_datasets.fake_1000

df["postcode"] = df["email"].str.slice(0, 3)

# Define custom comparison for postcode and city
postcode_city_comparison = cl.CustomComparison(
    output_column_name="postcode_city",
    comparison_levels=[
        cll.And(cll.NullLevel("postcode"), cll.NullLevel("postcode")),
        {
            "sql_condition": "postcode_l = postcode_r AND city_l = city_r",
            "label_for_charts": "Exact match on both postcode and city",
        },
        cll.ExactMatchLevel("postcode").configure(label_for_charts="Different city, exact match on postcode"),
        cll.ExactMatchLevel("city").configure(label_for_charts="Different postcode, exact match on city"),
        cll.ElseLevel(),
    ],
)

# Define settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        cl.NameComparison("first_name"),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        postcode_city_comparison,
    ],
    max_iterations=5,
)


linker = Linker(df, settings, db_api=DuckDBAPI())

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)


linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

linker.visualisations.match_weights_chart()

zmbc · 2024-09-20T17:15:15Z

@RobinL's solution is equivalent to method 1 in S4 of the appendix of the fastLink paper, and works great for some use-cases, but doesn't allow e.g. specifying c1 * c2 and c2 * c3 without c1 * c2 * c3. For that, the second method in that appendix based on log-linear models is the solution, and there is a discussion of adding it to Splink here: #1310

V-Lamp · 2024-10-13T15:47:39Z

Thank you for the response around using AND, I think I can incorporate it. However, one complexity that usually comes in practice is that comparisons usually have more than one comparison level.

For example:

postcode_comparison = [postcode_exact_match, postcode_area_match, postcode_sector_match]
address_comparison = [address_exact_match, street_name_match, address_fuzzy_match]
location_comparison = ???

I would need to make 9 levels (3 * 3) with AND, plus the other 3 + 3 levels, to define a location_comparison (so 3*3 + 3 + 3 levels in total). The challenging thing then is to find what is the right order for these 15 comparison levels, since ordering has a very significant effect. In my case, I actually have more that 3 levels, more like 6-8.

Have you found yourself in this combinatorial explosion and then further ordering problem?

RobinL · 2024-10-14T06:38:19Z

Yeah - you're right to highlight these challenges. It's typically best to try and order in terms of 'better matches higher' - start with the most precise matches and work your way down. Although I appreciate it's not always obvious in practice; you may need some trial and error.

I agree that the combinatorial explosion problem is real, but on a large dataset having (say) 9 comparison levels is totally fine. Ultimately, each one corresponds to two parameters to estimate, so 18 parameters is not very many at all (compared to, say, other ML approaches which can have thousands).

In our production models we tend to have between about 2 and 10 comparison levels per comparison

V-Lamp added the enhancement New feature or request label Sep 18, 2024

V-Lamp changed the title ~~[FEAT] Interaction term between two comparisons~~ [FEAT] Interaction term between two comparisons for correlated comparisons Sep 18, 2024

V-Lamp changed the title ~~[FEAT] Interaction term between two comparisons for correlated comparisons~~ [FEAT] Interaction term between two correlated comparisons Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Interaction term between two correlated comparisons #2413

[FEAT] Interaction term between two correlated comparisons #2413

V-Lamp commented Sep 18, 2024 •

edited

Loading

RobinL commented Sep 18, 2024 •

edited

Loading

zmbc commented Sep 20, 2024

V-Lamp commented Oct 13, 2024 •

edited

Loading

RobinL commented Oct 14, 2024

[FEAT] Interaction term between two correlated comparisons #2413

[FEAT] Interaction term between two correlated comparisons #2413

Comments

V-Lamp commented Sep 18, 2024 • edited Loading

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

RobinL commented Sep 18, 2024 • edited Loading

zmbc commented Sep 20, 2024

V-Lamp commented Oct 13, 2024 • edited Loading

RobinL commented Oct 14, 2024

V-Lamp commented Sep 18, 2024 •

edited

Loading

RobinL commented Sep 18, 2024 •

edited

Loading

V-Lamp commented Oct 13, 2024 •

edited

Loading