Weighting the matching fields #49

paull71 · 2021-03-15T23:16:21Z

Hi Guys,

Nice work with the package.

Currently using the package to link addresses in our company database to the Government National Address File (GNAF)(https://data.gov.au/data/dataset/19432f89-dc3a-4ef3-b943-5326ef1dbecc) which contains >14million addresses in Australia.

Aside from the size of the GNAF (which requires clustering to manage), the challenge has been weighting the fields in relative importance. Given this is all spatial, the value of matching Postcode > Suburb > Street name > house number in determining matches, however the weighting seems to be even. Is there any way of influencing the weighting placed on the various fields?

For example the address that I am matching is 8, Unique, Road, Town, 7000, South Australia, row 1 in the following table.

Num	Scenario	StreetNumber	StreetName	StreetType	Suburb	Postcode	State
1	Database	8	unique	street	orange	7000	south australia
2	GNAF	9	unique	street	apples	7000	south australia
3	Fastlink match	8	other	street	pears	7000	south australia
	match status	match	nomatch	match	no match	match	match

I can see a few addresses in the GNAF on Unique street (e.g. row 2), however the Town/suburb and StreetNumber is incorrect in my database, going by the authority that is the GNAF. Fastlink is picking up as a match an address that has the same number, different street in a whole other suburb (row 3). By my count row 3 has 4 matches and 2 non matches (to row 1), equal to a result as if it had picked row 2 to match row 1.

Is there anyway of prioritising the StreetName or any other field over the StreetNumber.

The subset of GNAF in South Australia has 440k rows while the South Australian subset of my database is 164 rows. The size of the 164 is influenced by the fact that 90% of my database can be identically matched, the 164 are the problematic ones with varying data quality issues.

kosukeimai · 2021-03-17T10:32:22Z

One suggestion: you should include exact matches in your matching process. This helps the algorithm estimate the parameters. After that, you can set aside exact matches and focus on non-exact matches. The algorithm is designed to self-weight each field and my hope is that including exact matches will do just that.

tedenamorado · 2021-03-17T14:22:20Z

Hi @paull71,

As @kosukeimai mentioned, one good idea is to include the exact matches into your original data. If your dataset is around 1700 observations, I do not think that will create any additional bottleneck in terms of computational efficiency. Basically, adding exact matches could help the model learn that matching on StreetName is more discriminative than what it already is when matching.

If the problem becomes too large as a result of adding back the exact matches, I would divide the problem into smaller ones by blocking on a field you believe is measure with little to no error (e.g., State or Postcode).

Another idea: combine StreetName and StreetType to create StreetNameType, by combining both fields, you are creating a field that is more discriminative and can differentiate between orange street and orange avenue. Such a strategy has worked well for us in practice and my guess is that StreetType does not take many values, so the model does not consider to be discriminative.

Hope this helps!

All my best,

Ted

aalexandersson · 2021-03-17T18:58:18Z

From my user perspective, I fully agree with the suggestions.

Another idea: Add the argument partial.match = c("StreetName") if you want to prioritize StreetName because without the argument you only compare full matches of StreetName, not partial matches.

Also, some care is required if you decide to combine two fields. For example, the default stringdist.method = "jw" prioritizes the first part of the string (e.g., the part StreetName in StreetNameType and the part StreetType in StreetTypeName).

paull71 · 2021-03-18T09:07:01Z

Thanks everyone - will give your suggestions a go.

aalexandersson mentioned this issue Nov 1, 2022

Looking for a way to feed threshold cutoffs to individual variables #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighting the matching fields #49

Weighting the matching fields #49

paull71 commented Mar 15, 2021

kosukeimai commented Mar 17, 2021

tedenamorado commented Mar 17, 2021

aalexandersson commented Mar 17, 2021

paull71 commented Mar 18, 2021

Weighting the matching fields #49

Weighting the matching fields #49

Comments

paull71 commented Mar 15, 2021

kosukeimai commented Mar 17, 2021

tedenamorado commented Mar 17, 2021

aalexandersson commented Mar 17, 2021

paull71 commented Mar 18, 2021