Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document steps for creating dictionary FSTs from updated LEXC source #109

Open
aarppe opened this issue Sep 19, 2023 · 2 comments
Open

Document steps for creating dictionary FSTs from updated LEXC source #109

aarppe opened this issue Sep 19, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation FST Tasks relating to FST creation meta Issues for tracking issues

Comments

@aarppe
Copy link
Contributor

aarppe commented Sep 19, 2023

The following are explicit instructions on creating a descriptive analyzer and normative generator (with morpheme boundaries) from updated LEXC source (undertaken in #108):

  1. Create basic morphological model

If one has compiled the aggregate LEXC file, lexicon.lexc (used to be lexicon.tmp.lexc), with the regular GiellaLT compilation scheme, one can use that file as the primary source.

read lexc src/fst/morphology/lexicon.lexc
define Morphology

Otherwise, one can compile the aggregate file as follows:

cat src/fst/root.lexc src/fst/stems/noun_stems.lexc src/fst/morphology/stems/verb_stems.lexc src/fst/morphology/stems/particles.lexc src/fst/morphology/stems/pronouns.lexc src/fst/morphology/stems/numerals.lexc src/fst/morphology/affixes/noun_affixes.lexc src/fst/morphology/affixes/verb_affixes.lexc > lexicon.lexc

  1. Create basic phonological model
source src/fst/phonology.xfscript
define Phonology
  1. Create filters for removing a) word fragments and b) orthographically non-standard forms.
regex ~[ $[ "+Err/Frag" ]];
define removeFragments

regex ~[ $[ "+Err/Orth" ]];
define removeNonStandardForms
  1. Create filter to select only forms belonging to dictionary parts-of-speech.
regex $[ "+N" | "+V" | "+Ipc" | "+Pron" ];
define selectDictPOS
  1. Compose normative generator.
set flag-is-epsilon ON
regex [ selectDictPOS .o. removeNonStandardForms .o. removeFragments .o. Morphology .o. Phonology ];
save stack generator-gt-dict-norm.hfst
define NormativeGenerator
  1. Specify transcriptor to remove special morpheme boundary characters.
regex [ [ "<" | ">" | "/" ] -> 0 ];
define removeBoundaries
  1. Load in basic model for spell relaxation.
load src/orthography/spellrelax.compose.hfst
define SpellRelax
  1. Compose descriptive analyzer
regex [ selectDictPOS .o. removeFragments .o. Morphology .o. Phonology .o. removeBoundaries .o. SpellRelax ];
# regex [ NormativeGenerator .o. removeBoundaries .o. SpellRelax ];
invert net
save stack analyser-gt-dict-desc.hfst
define DescriptiveAnalyser

Normally, the necessary FSTs would be created according to the standard GiellaLT compilation configruration, with the option --enable-dicts.

@aarppe aarppe added the FST Tasks relating to FST creation label Sep 19, 2023
@aarppe aarppe added documentation Improvements or additions to documentation meta Issues for tracking issues labels Sep 19, 2023
@fbanados
Copy link
Member

fbanados commented Jun 6, 2024

Note: Special morpheme boundary characters may need to also be removed from the Normative Generator FST.

@fbanados
Copy link
Member

fbanados commented Jun 6, 2024

After discussion with @aarppe, it was established that the expected behaviour for generator FSTs should be to include special morpheme boundary characters, and it is the job of the app to discard them when irrelevant. As shown in the instructions in this thread, it is ok for analyser FSTs to drop them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation FST Tasks relating to FST creation meta Issues for tracking issues
Projects
None yet
Development

No branches or pull requests

2 participants