Data and scripts for analysis used for Finding structure in Spelling and Pronunciation using Latent Dirichlet Allocation presented in NLP30/2024
Text files
- English spells (.csv)
- French spells (.csv)
- German spells (.csv)
- Russian spells (.csv)
- Swahili spells (.csv)
Scripts for analysis (Jupyter notebooks)
Running was confirmed on Python 3.9, 3.10, and 3.11.
Important Parameters:
- n_topics [integer]: number of topics for LDA
- doc_attr [string]: any of "spell", "sound"
- max_doc_size [integer]: maximum character length for docs to process
- term_type [string]: any of "1gram", "2gram", "3gram", "skippy2gram", "skippy3gram"
- ngram_is_inclusive [boolean]: a flag for making ngrams inclusive
- max_distance_val [int, depending on max_doc_size]: scope of skippy n-grams links
- term_min_freq [integer]: a filter against too infrequent terms (valued for gensim's "minfreq")
- term_abuse_threshold [float: 0~1.0]: a filter against too frequent terms (valued for gensim's "abuse_theshold")
Other paramers used are not recommended to modify. Do so at your own risk.
Needed Python packages
- pyLDAvis [recommended to install first of all]
- WordCloud
- plotly
- adjustText
Results in .html