Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requires linguist check: lib/constants/MD-pos.json #129

Open
fbanados opened this issue Jul 25, 2024 · 2 comments
Open

Requires linguist check: lib/constants/MD-pos.json #129

fbanados opened this issue Jul 25, 2024 · 2 comments
Labels
requires-linguistic-work get a linguist to deal with this!

Comments

@fbanados
Copy link
Member

fbanados commented Jul 25, 2024

The file MD-pos.json contains heuristics to guess the POS annotation for MD entries. However, they may not be as precise as they can. e.j. yahaw and ahpô used to map to Part, while they could map to IPJ and IPC, respectively. I've made some candidate changes, but I believe a linguist should check (and perhaps extend) this set of mappings to get more precise info in the importjson.

Screenshot 2024-07-25 at 5 19 48 PM

@fbanados fbanados added the requires-linguistic-work get a linguist to deal with this! label Jul 25, 2024
@aarppe
Copy link
Contributor

aarppe commented Jul 26, 2024

I have a fuzzy recollection of creating this equivalence file ages ago - the problem is that wrt FST it is underspecified, and we might need to add in the actual lexical (inflectional) classes manually anyhow, in the MD_LexCat field that we have for MD.

In this and other respects, I've been tweaking a script that tries to copy over the stem and lexical category fields from CW to MD, always when available, so that we are not reliant on Arok's CW having that information (e.g. for generating FSTs), if he would happen to change his mind. The script for that is:
crk/bin/add-cw_lemma-stem-and-lexcat-from-cw-2-md.sh and the resultant file: crk/generated/Maskwacis_altlab2.tsv - the latter will need some manual editing before replacing the ALTLab version of MD in crk/dicts/.

@fbanados
Copy link
Member Author

An alternative would be to parse the POS directly from the FST analysis that is stored in MD_LexCat when available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requires-linguistic-work get a linguist to deal with this!
Projects
None yet
Development

No branches or pull requests

2 participants