False Positives - TFN Recognizer gets a 1.0 confidence when passing checksum validation #1071

tunechr · 2023-05-15T00:05:47Z

tunechr
May 15, 2023

Hi All

We are getting a lot of false positives with our recognizers, this is mainly with TFN and PCI, and other recognizers with checksums.

We get many false matches in spreadsheets and even logs due to the probability of a matching number.

It seems like the code marks the confidence as 1 if it passes the checksum, and the context words are checked after this step.
So the code does not seem to take into account any context words?

Is there a way we can make these recognizers more accurate? or example by taking into account the context words.

Cheers
Chris

omri374 · 2023-05-15T07:45:24Z

omri374
May 15, 2023
Maintainer

Hi, checksum does raise the score to 1.0. Could you please provide some examples of false positives that pass a checksum?

9 replies

omri374 May 15, 2023
Maintainer

Then this would do the trick: analyzer.registry.remove_recognizer("AuTfnRecognizer")

wilfredchan1 May 15, 2023

Wouldn't this just remove TFN detection? We still need it.

omri374 May 15, 2023
Maintainer

Oh I misread your previous comment. I am not an expert in Australian codes, but I would consider removing the TFN recognizer, and creating a new recognizer class that inherits from it while having a slightly different logic. Alternatively as Presidio does not remove multiple results for the same substring, I would consider having a post process which checks if a string is both ABN and TFN (as an example) . That however would only help in specific cases like you've given.

tunechr May 16, 2023
Author

This will be a long reply, so apologies for the size!

I know you have already replied about the custom recognizer, thank you for that.

I thought I would add some info in case you were interested.
Here is an example of a string which matches PCI (This is a spreadsheet, and one of the cells is a percentage)

This gets picked up as it passes the validation of pattern and checksum, even though it does not have any matching context words.
In the log at the bottom, you see we get
'original_score': 0.3, 'score': 1.0,

TEXT:
COMPANY NAME MONTHLY BUDGET Date BUDGET TOTALS ESTIMATED ACTUAL DIFFERENCE Income 63300.0 57450.0 -5850.0 Expenses 54500.0 49630.0 4870.0 Balance (Income minus Expenses) 8800.0 7820.0 -980.0 Budget Overview chart is in this cell. Top 5 Operating Expenses are automatically updated in Top5Expenses table, below. WHAT ARE MY TOP 5 HIGHEST OPERATING EXPENSES? EXPENSE AMOUNT % OF EXPENSES 15% REDUCTION Maintenance and repairs 4600.000014 0.09268587576062866 690.0000021 Supplies 4500.00002 0.09067096554503326 675.000003 Rent or mortgage 4500.000017 0.09067096548458595 675.00000255 Taxes 3200.000021 0.06447713119081201 480.00000314999994 Advertising 2500.000005 0.05037275851299617 375.00000074999997 Total 19300.000077000004 0.388877696494056 2895.0000115499997 COMPANY NAME MONTHLY BUDGET INCOME ESTIMATED ACTUAL TOP 5 AMOUNT DIFFERENCE Net sales 60000.0 54000.0 54000.000005 -6000.0 Interest income 3000.0 3000.0 3000.000006 0.0 Asset sales (gain/loss) 300.0 450.0 450.000007 150.0 Total Income 63300.0 57450.0 -5850.0 COMPANY NAME MONTHLY BUDGET PERSONNEL EXPENSES ESTIMATED ACTUAL TOP 5 AMOUNT DIFFERENCE Wages 9500.0 9600.0 9600.000005 -100.0 Employee benefits 4000.0 0.0 6e-06 4000.0 Commission 5000.0 4500.0 4500.000007 500.0 Total Personnel Expenses 18500.0 14100.0 4400.0 COMPANY NAME MONTHLY BUDGET OPERATING EXPENSES ESTIMATED ACTUAL TOP 5 AMOUNT DIFFERENCE Advertising 3000.0 2500.0 2500.000005 500.0 Bad debts 2000.0 2000.0 2000.000006 0.0 Cash discounts 1500.0 2175.0 2175.000007 -675.0 Delivery costs 2000.0 1500.0 1500.000008 500.0 Depreciation 1000.0 1000.0 1000.000009 0.0 Dues and subscriptions 500.0 525.0 525.00001 -25.0 Insurance 1300.0 1275.0 1275.000011 25.0 Interest 2000.0 2200.0 2200.000012 -200.0 Legal and auditing 1000.0 800.0 800.000013 200.0 Maintenance and repairs 4500.0 4600.0 4600.000014 -100.0 Office supplies 800.0 750.0 750.000015 50.0 Postage 400.0 350.0 350.000016 50.0 Rent or mortgage 4100.0 4500.0 4500.000017 -400.0 Sales expenses 350.0 400.0 400.000018 -50.0 Shipping and storage 900.0 840.0 840.000019 60.0 Supplies 5000.0 4500.0 4500.00002 500.0 Taxes 3000.0 3200.0 3200.000021 -200.0 Telephone 250.0 280.0 280.000022 -30.0 Utilities 1400.0 1385.0 1385.000023 15.0 Other 1000.0 750.0 750.000024 250.0 Total Operating Expenses 36000.0 35530.0 470.0

Here is the log output.
[2023-05-10 16:29:48,872][decision_process][INFO][None][nlp artifacts:{"entities": ["63300.0 57450.0", "54500.0 49630.0", "8800.0 7820.0", "5", "5", "15%", "4500.000017 0.09067096548458595", "19300.000077000004", "0.388877696494056 2895.0000115499997", "5", "60000.0", "54000.0 54000.000005", "3000.0 3000.0", "300.0", "150.0", "63300.0", "5", "9500.0 9600.0 9600.000005", "4000.0", "6e-06 4000.0 Commission", "5000.0 4500.0", "500.0", "18500.0 14100.0 4400.0", "5", "3000.0 2500.0 2500.000005", "500.0", "2000.0 2000.0", "2000.000006 0.0", "1500.0 2175.0", "2175.000007", "2000.0", "500.0 525.0", "1300.0", "1275.0 1275.000011 25.0", "2000.0 2200.0 2200.000012", "1000.0 800.0", "200.0", "4500.0", "800.0", "750.0 750.000015 50.0", "400.0", "4100.0 4500.0 4500.000017", "350.0", "900.0", "5000.0 4500.0 4500.00002", "500.0", "3000.0 3200.0", "3200.000021", "250.0", "250.0", "36000.0", "470.0"], "tokens": ["COMPANY", "NAME", "MONTHLY", "BUDGET", "Date", "BUDGET", "TOTALS", "ESTIMATED", "ACTUAL", "DIFFERENCE", "Income", "63300.0", "57450.0", "-5850.0", "Expenses", "54500.0", "49630.0", "4870.0", "Balance", "(", "Income", "minus", "Expenses", ")", "8800.0", "7820.0", "-980.0", "Budget", "Overview", "chart", "is", "in", "this", "cell", ".", "Top", "5", "Operating", "Expenses", "are", "automatically", "updated", "in", "Top5Expenses", "table", ",", "below", ".", "WHAT", "ARE", "MY", "TOP", "5", "HIGHEST", "OPERATING", "EXPENSES", "?", "EXPENSE", "AMOUNT", "%", "OF", "EXPENSES", "15", "%", "REDUCTION", "Maintenance", "and", "repairs", "4600.000014", "0.09268587576062866", "690.0000021", "Supplies", "4500.00002", "0.09067096554503326", "675.000003", "Rent", "or", "mortgage", "4500.000017", "0.09067096548458595", "675.00000255", "Taxes", "3200.000021", "0.06447713119081201", "480.00000314999994", "Advertising", "2500.000005", "0.05037275851299617", "375.00000074999997", "Total", "19300.000077000004", " ", "0.388877696494056", "2895.0000115499997", "COMPANY", "NAME", "MONTHLY", "BUDGET", "INCOME", "ESTIMATED", "ACTUAL", "TOP", "5", "AMOUNT", "DIFFERENCE", "Net", "sales", "60000.0", "54000.0", "54000.000005", "-6000.0", "Interest", "income", "3000.0", "3000.0", "3000.000006", "0.0", "Asset", "sales", "(", "gain", "/", "loss", ")", "300.0", "450.0", "450.000007", "150.0", "Total", "Income", "63300.0", "57450.0", "-5850.0", "COMPANY", "NAME", "MONTHLY", "BUDGET", "PERSONNEL", "EXPENSES", "ESTIMATED", "ACTUAL", "TOP", "5", "AMOUNT", "DIFFERENCE", "Wages", "9500.0", "9600.0", "9600.000005", "-100.0", "Employee", "benefits", "4000.0", "0.0", "6e-06", "4000.0", "Commission", "5000.0", "4500.0", "4500.000007", "500.0", "Total", "Personnel", "Expenses", "18500.0", "14100.0", "4400.0", "COMPANY", "NAME", "MONTHLY", "BUDGET", "OPERATING", "EXPENSES", "ESTIMATED", "ACTUAL", "TOP", "5", "AMOUNT", "DIFFERENCE", "Advertising", "3000.0", "2500.0", "2500.000005", "500.0", "Bad", "debts", "2000.0", "2000.0", "2000.000006", "0.0", "Cash", "discounts", "1500.0", "2175.0", "2175.000007", "-675.0", "Delivery", "costs", "2000.0", "1500.0", "1500.000008", "500.0", "Depreciation", "1000.0", "1000.0", "1000.000009", "0.0", "Dues", "and", "subscriptions", "500.0", "525.0", "525.00001", "-25.0", "Insurance", "1300.0", "1275.0", "1275.000011", "25.0", "Interest", "2000.0", "2200.0", "2200.000012", "-200.0", "Legal", "and", "auditing", "1000.0", "800.0", "800.000013", "200.0", "Maintenance", "and", "repairs", "4500.0", "4600.0", "4600.000014", "-100.0", "Office", "supplies", "800.0", "750.0", "750.000015", "50.0", "Postage", "400.0", "350.0", "350.000016", "50.0", "Rent", "or", "mortgage", "4100.0", "4500.0", "4500.000017", "-400.0", "Sales", "expenses", "350.0", "400.0", "400.000018", "-50.0", "Shipping", "and", "storage", "900.0", "840.0", "840.000019", "60.0", "Supplies", "5000.0", "4500.0", "4500.00002", "500.0", "Taxes", "3000.0", "3200.0", "3200.000021", "-200.0", "Telephone", "250.0", "280.0", "280.000022", "-30.0", "Utilities", "1400.0", "1385.0", "1385.000023", "15.0", "Other", "1000.0", "750.0", "750.000024", "250.0", "Total", "Operating", "Expenses", "36000.0", "35530.0", "470.0", "\t"], "lemmas": ["company", "name", "MONTHLY", "BUDGET", "Date", "BUDGET", "total", "estimate", "actual", "difference", "income", "63300.0", "57450.0", "-5850.0", "expense", "54500.0", "49630.0", "4870.0", "balance", "(", "Income", "minus", "expense", ")", "8800.0", "7820.0", "-980.0", "Budget", "Overview", "chart", "be", "in", "this", "cell", ".", "top", "5", "operating", "expense", "be", "automatically", "update", "in", "top5expense", "table", ",", "below", ".", "what", "be", "my", "top", "5", "highest", "operating", "expense", "?", "EXPENSE", "amount", "%", "of", "expense", "15", "%", "reduction", "maintenance", "and", "repair", "4600.000014", "0.09268587576062866", "690.0000021", "supply", "4500.00002", "0.09067096554503326", "675.000003", "rent", "or", "mortgage", "4500.000017", "0.09067096548458595", "675.00000255", "taxis", "3200.000021", "0.06447713119081201", "480.00000314999994", "advertising", "2500.000005", "0.05037275851299617", "375.00000074999997", "total", "19300.000077000004", " ", "0.388877696494056", "2895.0000115499997", "company", "name", "MONTHLY", "BUDGET", "income", "estimate", "ACTUAL", "top", "5", "amount", "difference", "net", "sale", "60000.0", "54000.0", "54000.000005", "-6000.0", "interest", "income", "3000.0", "3000.0", "3000.000006", "0.0", "Asset", "sale", "(", "gain", "/", "loss", ")", "300.0", "450.0", "450.000007", "150.0", "Total", "Income", "63300.0", "57450.0", "-5850.0", "company", "name", "MONTHLY", "BUDGET", "PERSONNEL", "expense", "estimate", "actual", "top", "5", "amount", "difference", "wage", "9500.0", "9600.0", "9600.000005", "-100.0", "employee", "benefit", "4000.0", "0.0", "6e-06", "4000.0", "Commission", "5000.0", "4500.0", "4500.000007", "500.0", "Total", "Personnel", "expense", "18500.0", "14100.0", "4400.0", "company", "name", "MONTHLY", "BUDGET", "operating", "expense", "estimate", "actual", "top", "5", "amount", "difference", "advertising", "3000.0", "2500.0", "2500.000005", "500.0", "bad", "debt", "2000.0", "2000.0", "2000.000006", "0.0", "cash", "discount", "1500.0", "2175.0", "2175.000007", "-675.0", "delivery", "cost", "2000.0", "1500.0", "1500.000008", "500.0", "depreciation", "1000.0", "1000.0", "1000.000009", "0.0", "due", "and", "subscription", "500.0", "525.0", "525.00001", "-25.0", "insurance", "1300.0", "1275.0", "1275.000011", "25.0", "interest", "2000.0", "2200.0", "2200.000012", "-200.0", "legal", "and", "audit", "1000.0", "800.0", "800.000013", "200.0", "maintenance", "and", "repair", "4500.0", "4600.0", "4600.000014", "-100.0", "Office", "supply", "800.0", "750.0", "750.000015", "50.0", "postage", "400.0", "350.0", "350.000016", "50.0", "rent", "or", "mortgage", "4100.0", "4500.0", "4500.000017", "-400.0", "sale", "expense", "350.0", "400.0", "400.000018", "-50.0", "shipping", "and", "storage", "900.0", "840.0", "840.000019", "60.0", "supply", "5000.0", "4500.0", "4500.00002", "500.0", "taxis", "3000.0", "3200.0", "3200.000021", "-200.0", "Telephone", "250.0", "280.0", "280.000022", "-30.0", "utility", "1400.0", "1385.0", "1385.000023", "15.0", "other", "1000.0", "750.0", "750.000024", "250.0", "total", "operating", "expense", "36000.0", "35530.0", "470.0", "\t"], "tokens_indices": [0, 8, 13, 21, 28, 33, 40, 47, 57, 64, 75, 82, 90, 98, 106, 115, 123, 131, 138, 146, 147, 154, 160, 168, 170, 177, 184, 191, 198, 207, 213, 216, 219, 224, 228, 230, 234, 236, 246, 255, 259, 273, 281, 284, 297, 302, 304, 309, 311, 316, 320, 323, 327, 329, 337, 347, 355, 357, 365, 372, 374, 377, 386, 388, 390, 400, 412, 416, 424, 436, 456, 468, 477, 488, 508, 519, 524, 527, 536, 548, 568, 581, 587, 599, 619, 638, 650, 662, 682, 701, 707, 726, 727, 745, 764, 772, 777, 785, 792, 799, 809, 816, 820, 822, 829, 840, 844, 850, 858, 866, 879, 887, 896, 903, 910, 917, 929, 933, 939, 945, 946, 950, 951, 955, 957, 963, 969, 980, 986, 992, 999, 1007, 1015, 1023, 1031, 1036, 1044, 1051, 1061, 1070, 1080, 1087, 1091, 1093, 1100, 1111, 1117, 1124, 1131, 1143, 1150, 1159, 1168, 1175, 1179, 1185, 1192, 1203, 1210, 1217, 1229, 1235, 1241, 1251, 1260, 1268, 1276, 1283, 1291, 1296, 1304, 1311, 1321, 1330, 1340, 1347, 1351, 1353, 1360, 1371, 1383, 1390, 1397, 1409, 1415, 1419, 1425, 1432, 1439, 1451, 1455, 1460, 1470, 1477, 1484, 1496, 1503, 1512, 1518, 1525, 1532, 1544, 1550, 1563, 1570, 1577, 1589, 1593, 1598, 1602, 1616, 1622, 1628, 1638, 1644, 1654, 1661, 1668, 1680, 1685, 1694, 1701, 1708, 1720, 1727, 1733, 1737, 1746, 1753, 1759, 1770, 1776, 1788, 1792, 1800, 1807, 1814, 1826, 1833, 1840, 1849, 1855, 1861, 1872, 1877, 1885, 1891, 1897, 1908, 1913, 1918, 1921, 1930, 1937, 1944, 1956, 1963, 1969, 1978, 1984, 1990, 2001, 2007, 2016, 2020, 2028, 2034, 2040, 2051, 2056, 2065, 2072, 2079, 2090, 2096, 2102, 2109, 2116, 2128, 2135, 2145, 2151, 2157, 2168, 2174, 2184, 2191, 2198, 2210, 2215, 2221, 2228, 2234, 2245, 2251, 2257, 2267, 2276, 2284, 2292, 2297], "keywords": ["company", "monthly", "budget", "date", "budget", "total", "estimate", "actual", "difference", "income", "63300.0", "57450.0", "-5850.0", "expense", "54500.0", "49630.0", "4870.0", "balance", "income", "minus", "expense", "8800.0", "7820.0", "-980.0", "budget", "overview", "chart", "cell", "5", "operating", "expense", "automatically", "update", "top5expense", "table", "5", "highest", "operating", "expense", "expense", "expense", "15", "reduction", "maintenance", "repair", "4600.000014", "0.09268587576062866", "690.0000021", "supply", "4500.00002", "0.09067096554503326", "675.000003", "rent", "mortgage", "4500.000017", "0.09067096548458595", "675.00000255", "taxis", "3200.000021", "0.06447713119081201", "480.00000314999994", "advertising", "2500.000005", "0.05037275851299617", "375.00000074999997", "total", "19300.000077000004", " ", "0.388877696494056", "2895.0000115499997", "company", "monthly", "budget", "income", "estimate", "actual", "5", "difference", "net", "sale", "60000.0", "54000.0", "54000.000005", "-6000.0", "interest", "income", "3000.0", "3000.0", "3000.000006", "0.0", "asset", "sale", "gain", "loss", "300.0", "450.0", "450.000007", "150.0", "total", "income", "63300.0", "57450.0", "-5850.0", "company", "monthly", "budget", "personnel", "expense", "estimate", "actual", "5", "difference", "wage", "9500.0", "9600.0", "9600.000005", "-100.0", "employee", "benefit", "4000.0", "0.0", "6e-06", "4000.0", "commission", "5000.0", "4500.0", "4500.000007", "500.0", "total", "personnel", "expense", "18500.0", "14100.0", "4400.0", "company", "monthly", "budget", "operating", "expense", "estimate", "actual", "5", "difference", "advertising", "3000.0", "2500.0", "2500.000005", "500.0", "bad", "debt", "2000.0", "2000.0", "2000.000006", "0.0", "cash", "discount", "1500.0", "2175.0", "2175.000007", "-675.0", "delivery", "cost", "2000.0", "1500.0", "1500.000008", "500.0", "depreciation", "1000.0", "1000.0", "1000.000009", "0.0", "subscription", "500.0", "525.0", "525.00001", "-25.0", "insurance", "1300.0", "1275.0", "1275.000011", "25.0", "interest", "2000.0", "2200.0", "2200.000012", "-200.0", "legal", "audit", "1000.0", "800.0", "800.000013", "200.0", "maintenance", "repair", "4500.0", "4600.0", "4600.000014", "-100.0", "office", "supply", "800.0", "750.0", "750.000015", "50.0", "postage", "400.0", "350.0", "350.000016", "50.0", "rent", "mortgage", "4100.0", "4500.0", "4500.000017", "-400.0", "sale", "expense", "350.0", "400.0", "400.000018", "-50.0", "shipping", "storage", "900.0", "840.0", "840.000019", "60.0", "supply", "5000.0", "4500.0", "4500.00002", "500.0", "taxis", "3000.0", "3200.0", "3200.000021", "-200.0", "telephone", "250.0", "280.0", "280.000022", "-30.0", "utility", "1400.0", "1385.0", "1385.000023", "15.0", "1000.0", "750.0", "750.000024", "250.0", "total", "operating", "expense", "36000.0", "35530.0", "470.0", "\t"]}] [2023-05-10 16:29:48,967][decision_process][INFO][None][["{'entity_type': 'CREDIT_CARD', 'start': 729, 'end': 744, 'score': 1.0, 'analysis_explanation': {'recognizer': 'CreditCardRecognizer', 'pattern_name': 'All Credit Cards (weak)', 'pattern': '\\\\b((4\\\\d{3})|(5[0-5]\\\\d{2})|(6\\\\d{3})|(1\\\\d{3})|(3\\\\d{3}))[- ]?(\\\\d{3,4})[- ]?(\\\\d{3,4})[- ]?(\\\\d{3,5})\\\\b', 'original_score': 0.3, 'score': 1.0, 'textual_explanation': None, 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': True}, 'recognition_metadata': {'recognizer_name': 'CreditCardRecognizer', 'recognizer_identifier': 'CreditCardRecognizer_1846346470672'}}"]]

omri374 May 16, 2023
Maintainer

Checksum validation automatically boosts the confidence to 1.0 regardless of context words, assuming that checksum should have a very specific logic for which very few FPs are expected.

If you would like to have a lower score for a string that passed the checksum validation, you could configure it this way:

from presidio_analyzer import EntityRecognizer

EntityRecognizer.MAX_SCORE = 0.5

Then, all checksum validations would result in a boost to 0.5, and context words could potentially boost this even further.

Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False Positives - TFN Recognizer gets a 1.0 confidence when passing checksum validation #1071

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

False Positives - TFN Recognizer gets a 1.0 confidence when passing checksum validation #1071

tunechr May 15, 2023

Replies: 1 comment · 9 replies

omri374 May 15, 2023 Maintainer

omri374 May 15, 2023 Maintainer

wilfredchan1 May 15, 2023

omri374 May 15, 2023 Maintainer

tunechr May 16, 2023 Author

omri374 May 16, 2023 Maintainer

tunechr
May 15, 2023

Replies: 1 comment 9 replies

omri374
May 15, 2023
Maintainer

omri374 May 15, 2023
Maintainer

omri374 May 15, 2023
Maintainer

tunechr May 16, 2023
Author

omri374 May 16, 2023
Maintainer