-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping MD to CW #125
Comments
Just a note, that some of the entries with a |
Most of the entries have been merged. Current status is:
|
Are there any entries from MD that are not incorporated at all? In my scripting I noticed that there were perhaps just above 300 cases where an MD entry maps to multiple CW entries, which would need manual disambiguation.
Concerning these, we'd want to run an old script of mine scrutinizing AECD with the FST again, and see if we could reduce the number of unanalyzed cases (with |
For the
That would definitely be helpful. |
I'm also thinking whether we should introduce for each of the dictionaries a persistent unique identifier, that would allow us to link entries unambiguously? This could be based on some existing information, such as the entry head and lexical category, plus then an index to deal with ambiguity - or then it could be just a numeric code. I'd be inclined to consider a transparent PID, but I can be convinced otherwise. |
We can definitely add these identifiers to the dictionaries that do not change (MD, AECD, etc.). To account for ambiguities, the index should be recorded in the entries themselves to avoid issues like swapping entries in the source files. We can just add a column on the source TSV files for the identifier, whichever we choose. I do not have particular preferences either way. I am less certain about introducing identifiers in the CW toolbox files, we could have a discussion about that as well. |
How the extensions to the ALTLab version of the Maskwacîs Dictionary were planned, there are two fields that are intended to help establish that an entry in MD can be matched to an entry in CW (at some static point).
CW_lemma
indicates that the MD entry has (at some point) been mapped to an entry in CW. "Lemma" here means "entry head" in the lexicographical sense, rather than "baseform" in the computational sense. There are 7241 such MD entries, cf.CW_lemma
by itself is not sufficient to provide an unambiguous match with a CW entry, so some additional information is needed. Early on, we used the full English definitions as manually copied from CW to the MD database, but since those are under continuous editing, they are not reliable on the long term. See:MD_class
.MD_lemma
(and its associatesMD_stem
andMD_class
) were created to provide the necessary ingredients for including in the LEXC code those MD entries that could not be mapped to CW. There are 2566 such cases.Increasingly, these entries originally missing from CW have yet been added there, so for FST generation purposes our script checks if the combination of the
MD_lemma
andMD_class
map with the\sro
and\ps
fields in CW, in which case they are not added to the LEXC code.Besides the above, there are a number of entries in MD that are neither mapped to CW, nor provided with an
MD_lemma
, etc. While these would not be included in the FST, they could nevertheless yet be included in the *.importjson, but without getting a paradigm.Similar comparisons for LEXC inclusion have not yet been completed in the case of AECD to CW (but not MD).
The text was updated successfully, but these errors were encountered: