Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document steps for updating LEXC source for FSTs #108

Open
aarppe opened this issue Jul 19, 2023 · 2 comments
Open

Document steps for updating LEXC source for FSTs #108

aarppe opened this issue Jul 19, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation meta Issues for tracking issues

Comments

@aarppe
Copy link
Contributor

aarppe commented Jul 19, 2023

Following are the individual steps needed to update the LEXC source that will be used for the itwêwina (and other) FSTs (for which the compilation is outlined in #109).

  1. Update Cree Words (CW) source file CreeDict-x in Carleton repo

    • svn up
  2. Remove Windows-style CR characters from CW source, and copy this over to ALTLab repo

    • cat PlainsLexUni/CreeDict-x | tr -d '\r' > altlab/crk/dicts/Wolvengrey_altlab.toolbox
  3. Convert this Toolbox file into TSV format:

    • cat altlab/crk/dicts/Wolvengrey_altlab.toolbox | altlab/crk/bin/toolbox2tsv.sh > altlab/crk/generated/Wolvengrey_altlab.tsv
  4. Compare against Maskwacîs Dictionary content, and add unique entries (and associated stem and inflectional class information) after the CW entries:

    • altlab/crk/bin/add-md-entries-2-after-cw-tsv.sh altlab/crk/generated/Wolvengrey_altlab.tsv altlab/crk/dicts/Maskwacis_altlab.tsv > altlab/crk/generated/altlab.tsv
  5. Generate LEXC source for individual parts-of-speech from this ALTLab aggregated TSV file:

    • cat altlab/crk/generated/altlab.tsv | altlab/crk/bin/altlab2lexc.sh 'N' > altlab/crk/generated/noun_stems.lexc
    • cat altlab/crk/generated/altlab.tsv | altlab/crk/bin/altlab2lexc.sh 'V' > altlab/crk/generated/verb_stems.lexc
  6. Add copyright headers to LEXC sources, and copy over giellalt/lang-crk/src/fst/morphology/stems/

    • cat giellalt/lang-crk/src/fst/morphology/stems/noun_header.lexc altlab/crk/generated/noun_stems.lexc > giellalt/lang-crk/src/fst/morphology/stems/noun_stems.lexc
    • cat giellalt/lang-crk/src/fst/morphology/stems/verb_header.lexc altlab/crk/generated/verb_stems.lexc > giellalt/lang-crk/src/fst/morphology/stems/verb_stems.lexc
@aarppe aarppe added documentation Improvements or additions to documentation meta Issues for tracking issues labels Jul 19, 2023
@aarppe
Copy link
Contributor Author

aarppe commented Jul 20, 2023

There's now a shell script that does in one go all the above steps: altlab/crk/bin/update-crk-dictionary-sources-2-lexc.sh.

@aarppe
Copy link
Contributor Author

aarppe commented Sep 8, 2023

@M1Al3x The process outlined above to update the LEXC source, and thus the FSTs, needs to be done first, when incorporating updated dictionary content into itwêwina, before these same dictionary sources are processed into *.importjson for uploading into itwêwina. Thus, the steps are:

  1. Update dictionary sources
  2. Update LEXC sources based on updated dictionary sources
  3. Compile new FSTs in giellalt/lang-crk
  4. Aggregate and process dictionary sources into *.importjson for uploading to the intelligent dictionary
  5. Update the internal database for the intelligent dictionary, including whatever generation of forms in paradigms or English translation equivalents.

@M1Al3x This issue describes steps 1-2 above. The front page Readme.Md has the description for steps 4-5 above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation meta Issues for tracking issues
Projects
None yet
Development

No branches or pull requests

1 participant