-
Hi, I'm trying to create a custom PatternRecognizer for detecting Swedish personal identification numbers, but it seems it's not added properly in the end for use by the analyzer? Testing the recognizer itself works fine. Any clues to what I'm doing wrong? # Install and imports
!pip install presidio_analyzer
!pip install -U spacy_stanza
import stanza
stanza.download("sv")
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_analyzer.nlp_engine import NlpEngineProvider
# Custom recognizer
pnr_pattern = Pattern(name="pnr_pattern",regex="\d{6}(?:\d{2})?[-\s]?\d{4}", score = 0.8)
pnr_recognizer = PatternRecognizer(supported_entity="PNR", patterns=[pnr_pattern])
# Create configuration, engine etc.
configuration = { "nlp_engine_name": "stanza", "models": [{"lang_code": "sv", "model_name": "sv"}] }
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["sv"])
# Add PNR-recognizer
analyzer.registry.add_recognizer(pnr_recognizer)
# Testing recognizer alone, works fine
text_to_analyze = "Mvh Adam Svensson 821011-0217. Ring på 073-1212123."
pnr_result = pnr_recognizer.analyze(text=text_to_analyze, entities=["PNR"])
print(pnr_result)
# prints: '[type: PNR, start: 18, end: 29, score: 0.8]'
# Testing analyzer, doesn't work as expected...
text_to_analyze = "Mvh Adam Svensson 821011-0217. Ring på 073-1212123."
results = analyzer.analyze(text=text_to_analyze, language="sv", entities=['PHONE_NUMBER', 'PNR'])
print(results)
# prints: WARNING:presidio-analyzer:Entity PNR doesn't have the corresponding recognizer in language : sv
# [type: PHONE_NUMBER, start: 18, end: 29, score: 0.4, type: PHONE_NUMBER, start: 39, end: 50, score: 0.4]
# Listing the recognizers yields the corresponding result (i.e. no PNR-recognizer)
recs = analyzer.get_recognizers(language='sv')
for rec in recs:
print(f"-{rec.name}: {rec.supported_entities}")
# prints:
# -IpRecognizer: ['IP_ADDRESS']
# -EmailRecognizer: ['EMAIL_ADDRESS']
# -IbanRecognizer: ['IBAN_CODE']
# -CreditCardRecognizer: ['CREDIT_CARD']
# -MedicalLicenseRecognizer: ['MEDICAL_LICENSE']
# -StanzaRecognizer: ['DATE_TIME', 'NRP', 'LOCATION', 'PERSON']
# -CryptoRecognizer: ['CRYPTO']
# -UrlRecognizer: ['URL']
# -PhoneRecognizer: ['PHONE_NUMBER']
# -DateRecognizer: ['DATE_TIME'] |
Beta Was this translation helpful? Give feedback.
Answered by
omri374
Sep 5, 2023
Replies: 1 comment 2 replies
-
Hi, every recognizer can only support one language, so the only thing missing is to define the language in pnr_pattern = Pattern(name="pnr_pattern",regex="\d{6}(?:\d{2})?[-\s]?\d{4}", score = 0.8)
pnr_recognizer = PatternRecognizer(supported_entity="PNR", patterns=[pnr_pattern], supported_language="sv") Please check and let us know if you still experience issues. |
Beta Was this translation helpful? Give feedback.
2 replies
Answer selected by
mitch99
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, every recognizer can only support one language, so the only thing missing is to define the language in
pnr_recognizer
to supportsv
:Please check and let us know if you still experience issues.