-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119
Comments
although \lemma field should suffice, preferred approach is to use the FST analyzer |
Another example: nîpit |
This bug reflects a problem at the |
Aggregation is not detecting that the entry provided by the FST matches the entry in the database. This is because the FST generates the analysis |
I would assume there's a high likelihood that these small ordering gaps on word class codes would remain or reappear between sources, so I'm changing the comparison code to check for permutations at the subclass level. Because we are already checking constant length strings at this juncture it should not provide extra overhead. An alternative approach would be to always ensure that all sources follow the same ordering convention, but I think making the |
Also this requires a new |
I was thinking about the same thing, that there can be little discrepancies, and while we could fix this either in the FST, the morphodict code, or the database, we'd like to have a language-independent solution, that will work for non-Algonquian languages like Tsuut'ina. In this respect, what is the current requirement for establishing that an entry is an inflected form of another entry? That is, how is the FST analysis parsed in this respect? |
I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs. |
Also, this is an artifact of us in the computational modeling considering NA and NDA more similar than NDA and NDI. |
Either would work, but the fundamental problem is whether order is truly necessary for the analysis information (that is, whether it should be a list at all or a set instead).
The current comparison is done in the But it did not fix |
Updated the |
Currently going through the list to ensure that all entries with a lemma are added as wordform. Seems that this is still not the case. |
There are several (different) observable causes for this behaviour after checking the
Limitations on the FST are manifesting as well:
Most likely solution would be to attempt to match first against toolbox's |
Implementing the change to rely on
|
There was an agreement to implement a linguist-provided approach to override the lemma, and use the FST as backup. Ideally, |
|
The script referred in #122 does not deal with Many of the 170 get merged, but the rest need some extra detailed linguist analysis. E.g. |
|
My guess is that we should consider CW's |
Fixing ordering of tags in crk.altlab.dev also addresses some of the lack of emojis for UAlbertaALTLab/morphodict#1174 |
Entries that are inflected word-forms of other entries, e.g. nîminâniwan and nitâs, should not get their independent entries in morphodict, but should rather become
formof
cases.This works for nîminâniwan (--> nîmiw) but not for nitâs (--> mitâs).
When creating the
importjson
version of the dictionary content, this should either be recognized by the analyzing FST, or then via the\lemma
field in the *.toolbox source. See:Based on the presence of
\lemma
fields, there are at least 170 cases, and there might be more based on the FST scrutiny.The text was updated successfully, but these errors were encountered: