Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of FST analysis as part of entries in importjson does not completely reproduce previous behaviour #122

Open
fbanados opened this issue Jul 5, 2024 · 10 comments
Assignees
Labels
requires-linguistic-work get a linguist to deal with this!

Comments

@fbanados
Copy link
Member

fbanados commented Jul 5, 2024

(Was "Search regression: my cats / my dogs", but that behaviour has been fixed. Keeping the issue for the major source of inconsistencies that caused the previously observable bug.. See discussion after #122 (comment))

there is some issue (likely associated with the English Phrase FST not adding an +A tag) that prevents the dev version from correctly providing an inflected form when searching my cats/my dogs. However, the FST behaviours are equivalent, so a different justification for the failure must be identified to make the problem reproducible. Needs fixing.

@fbanados fbanados added the bug Something isn't working label Jul 5, 2024
@fbanados fbanados self-assigned this Jul 5, 2024
@fbanados
Copy link
Member Author

fbanados commented Jul 5, 2024

Cause is that importjson should include analysis in entries. Migrating issue to crk-db.

@fbanados fbanados transferred this issue from UAlbertaALTLab/morphodict Jul 5, 2024
@fbanados
Copy link
Member Author

fbanados commented Jul 5, 2024

e.g. entry for cats should have:

{ "analysis": [ [], "minôs", [ "+N", "+A", "+Sg" ] ], ...

currently, there's 7247 entries that should have an analysis and do not. Entries like ['oski-kinosêw', 'pwâkamowin', 'iskwâsam', 'ocipwêw', 'kîmîwin', 'pîhtwâkan', 'namêpîsis', 'miskîsik-maskihkiy', 'mihtot', 'matokahp', 'kaskikwâsopaýihcikan', 'macânês', 'wâýicihcêw', 'asinîwiýâkan', 'miskâcis', 'kwayaskosîhowin', 'nawatahikêwin', 'tipahamâkêstamâkêwin', 'côhkâp', 'sâpostawisiwin'].

also, there's 4834 entries that did not have an analysis and now do have one. Entries like ['kîhkâtêyihtâkwan', 'yâyikisâwâtêw', 'kakêhtawêyihtam', 'nîkânipayîstâkêw', 'otamêyihtâkwan', 'pîcicipayiw', 'nanwêyacimiwêw', 'kwayaskopayihêw', 'otânisihkâwêw', 'pahpawipayihow', 'misamêw', 'miyâmâc', 'nôtiniwêw', 'wiyê', 'âyîtahiwêw', 'pakosêyimow', 'iyinito-pahkwêsikan', 'namôya cî', 'kitimâkêyihtowak', 'atâmêyimowin']

@fbanados
Copy link
Member Author

fbanados commented Jul 5, 2024

There will be several notes to add about these examples, but for starters,
we should add analyses to all entries with a +A suffix tag.

This reduces to an issue with the POStag matching, again.

Therefore, first step here is to actually document and implement an appropriate linguist-based approach for matching POS tags between dictionaries and FSTs. There has been considerable discussion about this (some in emails), that will be added to the appropriate (new) issue.

@fbanados
Copy link
Member Author

fbanados commented Jul 5, 2024

After matching against the referenced https://github.com/giellalt/lang-crk/blob/main/tools/shellscripts/add-explicit-fields-to-crkeng.sh, issues with +A suffix are solved. However, there is still work remaining:

  • Classify, categorize and decide on extra analysis: It is likely that these should sometimes not be added if they weren't added before.
  • Classify, categorize, and provide a fix to add an analysis on the approx 1k pending entries that still do not include an analysis (and they did before) 77 pending entries 30 entries.

@fbanados
Copy link
Member Author

fbanados commented Jul 8, 2024

To fix regression, after ensuring that analysis includes +A, a restart of the docker container is required, otherwise search results are not properly sorted (that is, cosine_vector_distance is null instead of 0.0 in some cases, leading to incorrect results). That is a separate bug.

@fbanados
Copy link
Member Author

fbanados commented Jul 8, 2024

Most missing entries were Ipc, and a buggy comparison where IPC != Ipc. The 30 leftover are issues with the FST, that requires linguistic feedback. That is mostly heads that are no longer recognized by the strict FST ('nipâskâkow', 'mac-âyiwiwin', 'mac-âtocikêw', 'oski-ôsi', 'mac-âcimoskiw', 'mac-âcimowin', 'mac-âcimow', 'osk-âya', 'osk-âyi', 'mac-âyiwiw', 'mihtos', 'osk-âyisis', 'mac-âcimiwêw', 'mistiko-mahkahkos', 'waskway-ôsi', 'mistahi-ôsi', 'mac-âyisiwiw', 'nipêskâkow', 'môhkocikêwikamik', 'pîhtawêwayiwinisa', 'wâsitêpimâkanihkêw', 'mac-âcimêw', 'âpihtawakimâw', 'mêstakimâw', 'mac-âyisiw', 'iskotêw-ôsi', 'osk-âyis' or heads where the strict analysis produces multiple equal analyses ('akik', 'okosisimâw'). The entry in CW for kôhkomipaninaw has a different POS than that produced by the analysis of the FST. Seems that none of these entry differences have an impact on the presentation in the dictionary, but should be double-checked by a linguist.

@fbanados fbanados changed the title Search regression: my cats / my dogs Addition of FST analysis as part of entries in importjson does not completely reproduce previous behaviour Jul 8, 2024
@fbanados fbanados added requires-linguistic-work get a linguist to deal with this! and removed bug Something isn't working labels Jul 8, 2024
@aarppe
Copy link
Contributor

aarppe commented Jul 8, 2024

Some analyses:

  • Entries such as nipâwaskow are excluded, because they are an inflected form (inanimate actor) of TA verbs, coded in CW as \gr1 Independent, 0'-3s (inanimate actor). They haven't yet been implemented in the FST (that should happen soon, as something like +0Sg+3SgO), but they've been excluded from the LEXC code because they do not fit under any of the full verbal paradigms. Another entry of this sort would be nipêskâkow in the list above. Altogether, there's 17 of such cases:

less crk/Wolvengrey_altlab.toolbox | gawk 'BEGIN { FS="\n"; RS=""; } $0 ~ /ps VTA/ && $0 ~ /gr1[^\n]+\(inanimate actor\)/ { print $1, $4; }'       
\sro akâwêýihtamihikow \def s/he is bothered by a promise s/he made to do s.t.
\sro astâhikow \def it frightens s.o.; it causes s.o. to be wary, it worries s.o.
\sro câhcâmoskâkow \def s/he is made to sneeze by s.t., it makes her sneeze
\sro kipêýihtamiskâkow \def s/he overeats and feels badly, it (e.g. food) has the effect of making him/her feel bad
\sro kisiwaskatêskâkow \def it gives s.o. a stomach ache or indigestion
\sro kîskwêpêskâkow \def it makes s.o. drunk
\sro kîsposkâkow \def it filled s.o. up, it was a filling meal for s.o.
\sro mâýiskâkow \def it affects s.o. badly, it has an adverse effect on s.o.; it makes s.o. ill, it makes s.o. react allergically
\sro miýoskâkow \def it goes through s.o.'s body with good affect, it does s.o. good (e.g. animate food as actor); it fits s.o. well (e.g. pants)
\sro nanâtawiskâkow \def it has a healing effect on him/her
\sro nipâskâkow \def it makes s.o. sleep
\sro nipêskâkow \def it makes s.o. sleep
\sro paspinatikow \def s/he has a narrow escape, s.t. just misses him/her
\sro pêkatêskâkow \def it makes him/her belch, burp
\sro piscipôskâkow \def it poisons him/her
\sro sâposkâkow \def it goes through s.o., it enters s.o.'s body; it purges s.o.
\sro tawipaýihikow \def s/he has time

@aarppe
Copy link
Contributor

aarppe commented Jul 8, 2024

As for many of the other elements such as mac-âyiwiwin, they are a case where there is orthographical variation at the preverb/prenoun-stem junction, based on reduction in speech. The full form would be maci-âyiwiwin, but because the stem starts with a vowel the preverb-final -i- is often dropped.

We started a discussion with Arok about how to deal with these forms. One would be inclined to choose one variant as the more standard form, and then accept the variants (rather than creating two FST lemmas, if one enumerates both in the LEXC file for stems.) Currently these are sort-of catched by the script, in that the \fststem field is marked, cf.


\sro mac-âyiwiwin
...
\ps NI-1
\def being bad, being mean, being wicked; doing evil; having a bad temper; being a dangerous being
\stm maci-ayiwiwin-
\fststem CHECK:maci-ayiwiwinw?- OLD:mac-âyiwiwin

@fbanados
Copy link
Member Author

The FSTs seem to be detecting the variation, but from your comment it sounds like it may need to be the opposite of this:
Screenshot 2024-07-24 at 5 14 51 PM

Also, several of the other entries you mentioned as inflected forms (inanimate actor) of TA verbs are already detected as forms of. The ones that are not link to an empty paradigm.

@aarppe
Copy link
Contributor

aarppe commented Jul 26, 2024

Yes, I'd need to revise some parts of how the FST is generated for these sandhi forms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requires-linguistic-work get a linguist to deal with this!
Projects
None yet
Development

No branches or pull requests

2 participants