Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

aarppe · 2024-06-28T05:50:18Z

Entries that are inflected word-forms of other entries, e.g. nîminâniwan and nitâs, should not get their independent entries in morphodict, but should rather become formof cases.

This works for nîminâniwan (--> nîmiw) but not for nitâs (--> mitâs).

When creating the importjson version of the dictionary content, this should either be recognized by the analyzing FST, or then via the \lemma field in the *.toolbox source. See:

Correct behavior

Incorrect behavior (the first two entry blocks) vs. partially correct behavior (the next two entry blocks, though the inflected word-form should show the definition from the dictionary)

Based on the presence of \lemma fields, there are at least 170 cases, and there might be more based on the FST scrutiny.

less crk/dicts/Wolvengrey_altlab.toolbox| gawk 'BEGIN { FS="\n"; RS=""; } { for(i=1; i<=NF; i++) if(index($i,"\\lemma")!=0) print $1, $i; }' | wc -l
     170

The text was updated successfully, but these errors were encountered:

fbanados · 2024-06-28T16:34:46Z

although \lemma field should suffice, preferred approach is to use the FST analyzer

fbanados · 2024-06-28T16:36:49Z

Another example: nîpit

fbanados · 2024-06-28T17:55:55Z

This bug reflects a problem at the crk-db level. Migrating the issue.

fbanados · 2024-06-28T18:03:42Z

Aggregation is not detecting that the entry provided by the FST matches the entry in the database. This is because the FST generates the analysis mitâs+N+I+D+Px1Sg+Sg and mîpit+N+I+D+Px1Sg+Sg, respectively, while the Wolvengrey entries for mitâs and mîpit have both \ps NDI-1. Because the merging analysis does direct string comparisons, it's failing to detect that NID should be considered equal to NDI.

fbanados · 2024-06-28T18:20:56Z

I would assume there's a high likelihood that these small ordering gaps on word class codes would remain or reappear between sources, so I'm changing the comparison code to check for permutations at the subclass level. Because we are already checking constant length strings at this juncture it should not provide extra overhead. An alternative approach would be to always ensure that all sources follow the same ordering convention, but I think making the importjson generation more resilient is a better approach.

fbanados · 2024-06-28T18:23:14Z

Also this requires a new importjson, so I'll restart the import mentioned in UAlbertaALTLab/morphodict#1178, which was about 50% done.

aarppe · 2024-06-28T18:24:24Z

I was thinking about the same thing, that there can be little discrepancies, and while we could fix this either in the FST, the morphodict code, or the database, we'd like to have a language-independent solution, that will work for non-Algonquian languages like Tsuut'ina.

In this respect, what is the current requirement for establishing that an entry is an inflected form of another entry? That is, how is the FST analysis parsed in this respect?

aarppe · 2024-06-28T18:28:54Z

I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D} --> NDA.

Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.

aarppe · 2024-06-28T18:30:19Z

Also, this is an artifact of us in the computational modeling considering NA and NDA more similar than NDA and NDI.

fbanados · 2024-07-02T17:21:08Z

I'm actually wondering if we should turn this into a linguist problem, but not fully certain. In that we might want to have a linguist-defined mapping between certain FST codes and POS classes, rather than having the code try to figure this out. I.e./E.g. {+N, +A, +D} --> NDA.

Either would work, but the fundamental problem is whether order is truly necessary for the analysis information (that is, whether it should be a list at all or a set instead).

Alternatively, I'm wondering whether the comparison should be done with the same type of input, that is comparing the FST analyses of nitâs and mitâs, rather than comparing the FST analysis of nitâs with the p-o-s code of mitâs.

The current comparison is done in the isPOSMatch method https://github.com/UAlbertaALTLab/crk-db/blob/aecd/lib/aggregate/index.js. Changing this ordering fixes nîpit:

But it did not fix nitâs. nitâs is a different test case, the key difference being that nitâs has multiple entries on the dictionary (@ndi and @nda). Previously, addFormOf gave up in the case of multiple candidates. I've changed the code to attempt to find a unique match depending on the category. The change for nitâs is independent from the decision of making this a linguist problem as it was an issue at the mapping level that happens in a separate pass after the FST information has been collected and added to all entries.

fbanados · 2024-07-03T16:47:09Z

Updated the importjson on the dev branch of itwêwina to compare. For example, see
https://itwewina.altlab.dev/search?q=nîminâniwan
https://itwewina.altlab.dev/search?q=nîpit
https://itwewina.altlab.dev/search?q=nitâs

fbanados · 2024-07-03T16:49:50Z

Currently going through the list to ensure that all entries with a lemma are added as wordform. Seems that this is still not the case.

fbanados · 2024-07-03T17:21:21Z

There are several (different) observable causes for this behaviour after checking the \lemma cases previously discussed.
In general, it looks like crk-db is relying on the strict FST and ignoring annotations from Wolvengrey.

kôhtâwînaw shows that multiple definitions appearing in the same toolbox entry are not merged. This may be expected behaviour, but perhaps multiple \def entries should be merged into a same entry, not just the ones separated by a semicolon ;. That is a linguist decision.

Limitations on the FST are manifesting as well:

Given that ý characters are rejected, entries like aýwêpinâniwan in Wolvengrey are only accepted by the relaxed FST. Solution is either to remove ý before analyzing, or to change the FST to accept ý.
Some new Wolvengrey entries are still rejected by the FST: e.g. mêscakâs and mêstakay.

Most likely solution would be to attempt to match first against toolbox's \lemma, and only if that is not available, revert to the FST. Also, I would expect a report on the differences (say, either that the FST generates a different lemma than the toolbox entry or that the FST rejects an entry included in the dictionary) to be a useful report that could be used to debug and guide linguist decisions (e.g., decide whether those are bugs in the toolbox file or at the FST level, limitations of the model that need update, etc.).

fbanados · 2024-07-03T18:04:27Z

Implementing the change to rely on \lemma has the following impact:

86 entries from AECD stop being merged, in an unrelated bug that must be fixed (currently crk-db gives up on multiple candidate mappings to merge. This should definitely be done in a more regular fashion and not in an ad-hoc way)
Analysis of 5 entries changes from +Px12Pl to +Px1Sg, e.g. kikâwînaw form of nikâwiy
Analysis of ~100 entries changes from +Px1Sg to +PxX, e.g. nacâs form of macâs

fbanados · 2024-07-03T21:31:28Z

There was an agreement to implement a linguist-provided approach to override the lemma, and use the FST as backup. Ideally, crk-db would also have a way to avoid the FST altogether as an option.

fbanados · 2024-07-03T21:33:00Z

ý entry issues should be handled by the FST, so discussion about those is to be continued at #115

fbanados · 2024-07-08T19:33:38Z

The script referred in #122 does not deal with IPJ as a wordclass, which was required to ensure that all lemma annotations in CW were properly handled by crk-db (otherwise, interjection analyses were not included and broke importjson when their formOf references were manually added).

Many of the 170 get merged, but the rest need some extra detailed linguist analysis. E.g. nitôkimâm or nitiýinîmak does not match because NDA != NA, which makes sense. If anything is to change there, is at the linguist level.

aarppe · 2024-07-08T20:59:22Z

IPJ ~ Interjection is supposed to be a subclass of IPC, namely Independent (indeclining) particle. Currently, the FST generation should result in the tag sequence +Ipc+Interj for such lexemes. The pattern should be the same for other types of particles, i.e. +Ipc+....

fbanados · 2024-07-08T21:06:15Z

My guess is that we should consider CW's IPJ as equivalent to the FST's +Ipc+Interj (currently IPJ != +Ipc+Interj when generating entries, when they should be treated equally unless I'm missing an important linguistic concept). There are several other IP* tags in crk.altlabel.tsv. It would be useful to have the standard mappings of FST tags to these to ensure that the POS comparison works correctly.

fbanados · 2024-07-24T23:38:06Z

Fixing ordering of tags in crk.altlab.dev also addresses some of the lack of emojis for UAlbertaALTLab/morphodict#1174

aarppe added the bug Something isn't working label Jun 28, 2024

fbanados transferred this issue from UAlbertaALTLab/morphodict Jun 28, 2024

fbanados self-assigned this Jun 28, 2024

fbanados mentioned this issue Jul 2, 2024

Change in POS description in Wolvengrey #120

Open

fbanados added a commit that referenced this issue Jul 3, 2024

Merge definitions, temporary fix for #119

ed993c9

fbanados added the requires-linguistic-work get a linguist to deal with this! label Jul 8, 2024

fbanados mentioned this issue Jul 8, 2024

Searches result in duplicate dictionary entries UAlbertaALTLab/morphodict#374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

aarppe commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

aarppe commented Jun 28, 2024

aarppe commented Jun 28, 2024 •

edited

Loading

aarppe commented Jun 28, 2024

fbanados commented Jul 2, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 8, 2024

aarppe commented Jul 8, 2024

fbanados commented Jul 8, 2024 •

edited

Loading

fbanados commented Jul 24, 2024

Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

Content update: inflected word-form entries in dictionaries should not receive independent morphodict entries #119

Comments

aarppe commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

fbanados commented Jun 28, 2024

aarppe commented Jun 28, 2024

aarppe commented Jun 28, 2024 • edited Loading

aarppe commented Jun 28, 2024

fbanados commented Jul 2, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 3, 2024

fbanados commented Jul 8, 2024

aarppe commented Jul 8, 2024

fbanados commented Jul 8, 2024 • edited Loading

fbanados commented Jul 24, 2024

aarppe commented Jun 28, 2024 •

edited

Loading

fbanados commented Jul 8, 2024 •

edited

Loading