Rule-based handling of punctuation #14

danielhers · 2018-05-09T12:19:50Z

Since punctuation has a specific location it has to appear in (according to UCCA normalization rules, it has to be a child of the lowest common ancestor of its preceding and following terminal), there is no need to make the classifier decide where it should go.

Punctuation should stay in the list of terminals so that the BiLSTM sees it when going over the text, but it should not go in the buffer as nodes.

CarolLi · 2019-05-10T02:51:23Z

I have a large dataset and want to parse it using UCCA. But, there is a kind of punctuation commonly used in this dataset which is recognized as a word after parsing, how can I deal with this problem?
<node ID="0.7" type="Word"> <attributes paragraph="1" paragraph_position="7" text="``" /> </node>

danielhers · 2019-05-13T17:47:59Z

For plain text, TUPA uses spaCy for tokenization and punctuation identification. This is the relevant line of code: https://github.com/danielhers/ucca/blob/master/ucca/convert.py#L769
Now, spaCy (at least with the en_core_web_md model) seems to treat `` as non-punctuation.
To fix this, either replace all occurrences of this with something else, like ", or use a different spaCy model that treats it as punctuation.
To change the spaCy model used, do

from ucca import textutil
textutil.nlp["en"] = my_custom_spacy_model()

Where my_custom_spacy_model is a function you created that returns a custom spaCy model. You can use a different language than English by passing its two-letter code instead of en, but then just remember to pass the same two-letter code as --lang to TUPA.
To create a custom spaCy model with different tokenization/punctuation-identification, see this question and its answer: https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex

CarolLi · 2019-05-20T03:19:39Z

For plain text, TUPA uses spaCy for tokenization and punctuation identification. This is the relevant line of code: https://github.com/danielhers/ucca/blob/master/ucca/convert.py#L769
Now, spaCy (at least with the en_core_web_md model) seems to treat `` as non-punctuation.
To fix this, either replace all occurrences of this with something else, like ", or use a different spaCy model that treats it as punctuation.
To change the spaCy model used, do
from ucca import textutil
textutil.nlp["en"] = my_custom_spacy_model()
Where my_custom_spacy_model is a function you created that returns a custom spaCy model. You can use a different language than English by passing its two-letter code instead of en, but then just remember to pass the same two-letter code as --lang to TUPA.
To create a custom spaCy model with different tokenization/punctuation-identification, see this question and its answer: https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex

Thank you for your detailed answer!

danielhers added the enhancement label May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule-based handling of punctuation #14

Rule-based handling of punctuation #14

danielhers commented May 9, 2018

CarolLi commented May 10, 2019

danielhers commented May 13, 2019

CarolLi commented May 20, 2019

Rule-based handling of punctuation #14

Rule-based handling of punctuation #14

Comments

danielhers commented May 9, 2018

CarolLi commented May 10, 2019

danielhers commented May 13, 2019

CarolLi commented May 20, 2019