conected words in Portuguese #12929
Replies: 1 comment
-
Hi! So if I understand you correctly, you have automatically generated transcripts that contain concatenations of words that should have really had a space inbetween, like "casadodespachoda" that should have been "casado despachoda", and additional some words may further contain (internal) spelling errors, right? In general, spelling mistakes are not something that are easy to address in the spaCy framework, as the processing of a text with a spaCy pipeline typically assumes that the underlying text can always be reconstructed to the original, and we typically advise to do text preprocessing in a separate step. That said, there are a few packages that have attempted to do something like this (perhaps with a focus on English), like https://github.com/R1j1t/contextualSpellCheck, but I'm not sure how immediately useful that will be in your project. For the tokenization part, you can define a custom tokenizer or tokenizer rules that split up one word into multiple, as showcases in the docs here where "gimme" is split into the tokens "gim" and "me". The issue then is that you'll have a huge list of rules, as (as I understand it) any two words can be arbitrarily lumped together. My first idea is that you'll likely want to implement some dictionary-based heuristics, that attempt to recognize subwords and suggest splits that make statistical sense to then split the tokens accordingly. |
Beta Was this translation helpful? Give feedback.
-
Hello There,
I am learning SpaCy.
I am facing an issue. My documents are automatic transcriptions of Inquisition trials from the first half of the 17th century.
and words are connected together and the spelling of words. When I tried to use SpaCy for NER, the results were not promissing.
Is there any steps I could take to separate words? Any ideas on how to modernise word spelling in Portuguese?
I included below and example.
Thank you for your attention.
Cordially,
Lucia
"Aostrinta ehumdias domesdeMayo
doannodemil esiescentos quarenta ecinco
emLxa nosestaose casadodespachoda Sta
Inquisiçaõ estandoahy emaudienca damanhã
o senhor inquisidor Pedro de Castilho mandou
vir dantesy aMathias deAlbuquerque conde
deAlegretteesendo presente lhe foy dado
iuramentodossantos euangelhos emquepos"
It should be "Aos trinta e hum dias do mes de Mayo
do anno de mil e siescentos quarenta e cinco
em Lxa nos estaos e casa do despacho da Sta
Inquisiçaõ estando ahy em audiencia da manhã
o senhor inquisidor Pedro de Castilho mandou
vir dante sy a Mathias de Albuquerque conde
de Alegrette e sendo presente lhe foy dado
iuramento dos santos euangelhos em que pos"
Beta Was this translation helpful? Give feedback.
All reactions