Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

Open
aarppe opened this issue Jun 28, 2024 · 20 comments
Assignees
Labels
bug Something isn't working requires-linguistic-work get a linguist to deal with this!

Comments

@aarppe
Copy link
Contributor

aarppe commented Jun 28, 2024

Entries that are inflected word-forms of other entries, e.g. nîminâniwan and nitâs, should not get their independent entries in morphodict, but should rather become formof cases.

This works for nîminâniwan (--> nîmiw) but not for nitâs (--> mitâs).

When creating the importjson version of the dictionary content, this should either be recognized by the analyzing FST, or then via the \lemma field in the *.toolbox source. See:

  1. Correct behavior
image
  1. Incorrect behavior (the first two entry blocks) vs. partially correct behavior (the next two entry blocks, though the inflected word-form should show the definition from the dictionary)
image

Based on the presence of \lemma fields, there are at least 170 cases, and there might be more based on the FST scrutiny.

less crk/dicts/Wolvengrey_altlab.toolbox| gawk 'BEGIN { FS="\n"; RS=""; } { for(i=1; i<=NF; i++) if(index($i,"\\lemma")!=0) print $1, $i; }' | wc -l
     170
@aarppe aarppe added the bug Something isn't working label Jun 28, 2024
@fbanados
Copy link
Member

although \lemma field should suffice, preferred approach is to use the FST analyzer

@fbanados
Copy link
Member

Another example: nîpit

@fbanados
Copy link
Member

This bug reflects a problem at the crk-db level. Migrating the issue.

@fbanados fbanados transferred this issue from UAlbertaALTLab/morphodict Jun 28, 2024
@fbanados
Copy link
Member

Aggregation is not detecting that the entry provided by the FST matches the entry in the database. This is because the FST generates the analysis mitâs+N+I+D+Px1Sg+Sg and mîpit+N+I+D+Px1Sg+Sg, respectively, while the Wolvengrey entries for mitâs and mîpit have both \ps NDI-1. Because the merging analysis does direct string comparisons, it's failing to detect that NID should be considered equal to NDI.

@fbanados
Copy link
Member

I would assume there's a high likelihood that these small ordering gaps on word class codes would remain or reappear between sources, so I'm changing the comparison code to check for permutations at the subclass level. Because we are already checking constant length strings at this juncture it should not provide extra overhead. An alternative approach would be to always ensure that all sources follow the same ordering convention, but I think making the importjson generation more resilient is a better approach.

@fbanados
Copy link
Member

Also this requires a new importjson, so I'll restart the import mentioned in UAlbertaALTLab/morphodict#1178, which was about 50% done.

@aarppe
Copy link
Contributor Author

aarppe commented Jun 28, 2024

I was thinking about the same thing, that there can be little discrepancies, and while we could fix this either in the FST, the morphodict code, or the database, we'd like to have a language-independent solution, that will work for non-Algonquian languages like Tsuut'ina.

In this respect, what is the current requirement for establishing that an entry is an inflected form of another entry? That is, how is the FST analysis parsed in this respect?

@aarppe
Copy link
Contributor Author

aarppe commented Jun 28, 2024

I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D} --> NDA.

Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.

@aarppe
Copy link
Contributor Author

aarppe commented Jun 28, 2024

Also, this is an artifact of us in the computational modeling considering NA and NDA more similar than NDA and NDI.

@fbanados fbanados self-assigned this Jun 28, 2024
@fbanados
Copy link
Member

fbanados commented Jul 2, 2024

I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D} --> NDA.

Either would work, but the fundamental problem is whether order is truly necessary for the analysis information (that is, whether it should be a list at all or a set instead).

Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.

The current comparison is done in the isPOSMatch method https://github.com/UAlbertaALTLab/crk-db/blob/aecd/lib/aggregate/index.js. Changing this ordering fixes nîpit:
Screenshot 2024-06-28 at 12 28 25 PM

But it did not fix nitâs. nitâs is a different test case, the key difference being that nitâs has multiple entries on the dictionary (@ndi and @nda). Previously, addFormOf gave up in the case of multiple candidates. I've changed the code to attempt to find a unique match depending on the category. The change for nitâs is independent from the decision of making this a linguist problem as it was an issue at the mapping level that happens in a separate pass after the FST information has been collected and added to all entries.

Screenshot 2024-07-02 at 11 15 09 AM

@fbanados
Copy link
Member

fbanados commented Jul 3, 2024

Updated the importjson on the dev branch of itwêwina to compare. For example, see
https://itwewina.altlab.dev/search?q=nîminâniwan
https://itwewina.altlab.dev/search?q=nîpit
https://itwewina.altlab.dev/search?q=nitâs

@fbanados
Copy link
Member

fbanados commented Jul 3, 2024

Currently going through the list to ensure that all entries with a lemma are added as wordform. Seems that this is still not the case.

@fbanados
Copy link
Member

fbanados commented Jul 3, 2024

There are several (different) observable causes for this behaviour after checking the \lemma cases previously discussed.
In general, it looks like crk-db is relying on the strict FST and ignoring annotations from Wolvengrey.

  • kôhtâwînaw shows that multiple definitions appearing in the same toolbox entry are not merged. This may be expected behaviour, but perhaps multiple \def entries should be merged into a same entry, not just the ones separated by a semicolon ;. That is a linguist decision.

Limitations on the FST are manifesting as well:

  • Given that ý characters are rejected, entries like aýwêpinâniwan in Wolvengrey are only accepted by the relaxed FST. Solution is either to remove ý before analyzing, or to change the FST to accept ý.
  • Some new Wolvengrey entries are still rejected by the FST: e.g. mêscakâs and mêstakay.

Most likely solution would be to attempt to match first against toolbox's \lemma, and only if that is not available, revert to the FST. Also, I would expect a report on the differences (say, either that the FST generates a different lemma than the toolbox entry or that the FST rejects an entry included in the dictionary) to be a useful report that could be used to debug and guide linguist decisions (e.g., decide whether those are bugs in the toolbox file or at the FST level, limitations of the model that need update, etc.).

@fbanados
Copy link
Member

fbanados commented Jul 3, 2024

Implementing the change to rely on \lemma has the following impact:

  • 86 entries from AECD stop being merged, in an unrelated bug that must be fixed (currently crk-db gives up on multiple candidate mappings to merge. This should definitely be done in a more regular fashion and not in an ad-hoc way)
  • Analysis of 5 entries changes from +Px12Pl to +Px1Sg, e.g. kikâwînaw form of nikâwiy
  • Analysis of ~100 entries changes from +Px1Sg to +PxX, e.g. nacâs form of macâs

@fbanados
Copy link
Member

fbanados commented Jul 3, 2024

There was an agreement to implement a linguist-provided approach to override the lemma, and use the FST as backup. Ideally, crk-db would also have a way to avoid the FST altogether as an option.

@fbanados
Copy link
Member

fbanados commented Jul 3, 2024

ý entry issues should be handled by the FST, so discussion about those is to be continued at #115

@fbanados
Copy link
Member

fbanados commented Jul 8, 2024

The script referred in #122 does not deal with IPJ as a wordclass, which was required to ensure that all lemma annotations in CW were properly handled by crk-db (otherwise, interjection analyses were not included and broke importjson when their formOf references were manually added).

Many of the 170 get merged, but the rest need some extra detailed linguist analysis. E.g. nitôkimâm or nitiýinîmak does not match because NDA != NA, which makes sense. If anything is to change there, is at the linguist level.

@fbanados fbanados added the requires-linguistic-work get a linguist to deal with this! label Jul 8, 2024
@aarppe
Copy link
Contributor Author

aarppe commented Jul 8, 2024

IPJ ~ Interjection is supposed to be a subclass of IPC, namely Independent (indeclining) particle. Currently, the FST generation should result in the tag sequence +Ipc+Interj for such lexemes. The pattern should be the same for other types of particles, i.e. +Ipc+....

@fbanados
Copy link
Member

fbanados commented Jul 8, 2024

My guess is that we should consider CW's IPJ as equivalent to the FST's +Ipc+Interj (currently IPJ != +Ipc+Interj when generating entries, when they should be treated equally unless I'm missing an important linguistic concept). There are several other IP* tags in crk.altlabel.tsv. It would be useful to have the standard mappings of FST tags to these to ensure that the POS comparison works correctly.

@fbanados
Copy link
Member

Fixing ordering of tags in crk.altlab.dev also addresses some of the lack of emojis for UAlbertaALTLab/morphodict#1174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires-linguistic-work get a linguist to deal with this!
Projects
None yet
Development

No branches or pull requests

2 participants