Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighting the matching fields #49

Open
paull71 opened this issue Mar 15, 2021 · 4 comments
Open

Weighting the matching fields #49

paull71 opened this issue Mar 15, 2021 · 4 comments

Comments

@paull71
Copy link

paull71 commented Mar 15, 2021

Hi Guys,

Nice work with the package.

Currently using the package to link addresses in our company database to the Government National Address File (GNAF)(https://data.gov.au/data/dataset/19432f89-dc3a-4ef3-b943-5326ef1dbecc) which contains >14million addresses in Australia.

Aside from the size of the GNAF (which requires clustering to manage), the challenge has been weighting the fields in relative importance. Given this is all spatial, the value of matching Postcode > Suburb > Street name > house number in determining matches, however the weighting seems to be even. Is there any way of influencing the weighting placed on the various fields?

For example the address that I am matching is 8, Unique, Road, Town, 7000, South Australia, row 1 in the following table.

Num Scenario StreetNumber StreetName StreetType Suburb Postcode State
1 Database 8 unique street orange 7000 south australia
2 GNAF 9 unique street apples 7000 south australia
3 Fastlink match 8 other street pears 7000 south australia
  match status match nomatch match no match match match

I can see a few addresses in the GNAF on Unique street (e.g. row 2), however the Town/suburb and StreetNumber is incorrect in my database, going by the authority that is the GNAF. Fastlink is picking up as a match an address that has the same number, different street in a whole other suburb (row 3). By my count row 3 has 4 matches and 2 non matches (to row 1), equal to a result as if it had picked row 2 to match row 1.

Is there anyway of prioritising the StreetName or any other field over the StreetNumber.

The subset of GNAF in South Australia has 440k rows while the South Australian subset of my database is 164 rows. The size of the 164 is influenced by the fact that 90% of my database can be identically matched, the 164 are the problematic ones with varying data quality issues.

@kosukeimai
Copy link
Owner

One suggestion: you should include exact matches in your matching process. This helps the algorithm estimate the parameters. After that, you can set aside exact matches and focus on non-exact matches. The algorithm is designed to self-weight each field and my hope is that including exact matches will do just that.

@tedenamorado
Copy link
Collaborator

Hi @paull71,

As @kosukeimai mentioned, one good idea is to include the exact matches into your original data. If your dataset is around 1700 observations, I do not think that will create any additional bottleneck in terms of computational efficiency. Basically, adding exact matches could help the model learn that matching on StreetName is more discriminative than what it already is when matching.

If the problem becomes too large as a result of adding back the exact matches, I would divide the problem into smaller ones by blocking on a field you believe is measure with little to no error (e.g., State or Postcode).

Another idea: combine StreetName and StreetType to create StreetNameType, by combining both fields, you are creating a field that is more discriminative and can differentiate between orange street and orange avenue. Such a strategy has worked well for us in practice and my guess is that StreetType does not take many values, so the model does not consider to be discriminative.

Hope this helps!

All my best,

Ted

@aalexandersson
Copy link

From my user perspective, I fully agree with the suggestions.

Another idea: Add the argument partial.match = c("StreetName") if you want to prioritize StreetName because without the argument you only compare full matches of StreetName, not partial matches.

Also, some care is required if you decide to combine two fields. For example, the default stringdist.method = "jw" prioritizes the first part of the string (e.g., the part StreetName in StreetNameType and the part StreetType in StreetTypeName).

@paull71
Copy link
Author

paull71 commented Mar 18, 2021

Thanks everyone - will give your suggestions a go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants