The project aims to identify and inspect latent topics across all community posts (from 2017 to 2019) to help Inspire's Data Science Team better understand the hidden patterns and monitor topic drifting through an interactive interface. A topic in this context is a set of words that can be used to represent a document.
The output is a well defined set of topics that describe each document in the collections. A specific visualization technique is adpoted for assessing textual LDA topic model - Termite. "Termite is a visualization tool for inspecting the output of statistical topic models such as Latent Dirichlet allocation (LDA) using an interactive interface as shown above. Termite is an alternative to lists of per-topic words, the standard practice: Users can drill down to examine a specific topic by clicking on a circle or topic label in the matrix, revealing the word-frequency view. The order of the terms presented in this view also uses seriation, which accounts for co-occurrence and collocation likelihood between all pairs of words. Term probabilities are encoded in circles."[2] For more details, see Chuang et al [1].
- Revive Termite in Streamlit!
- Separate modeling process from viusalization
- Removes dependencies
- Take topic-term matrix as input
- Implement interactive as Streamlit App
- Refine evaluation methodologies for the topic quality
[1]Termite: Visualization Techniques for Assessing Textual Topic Models. Jason Chuang, Christopher D. Manning, Jeffrey Heer. Computer Science Dept, Stanford University. [2]Revised Version of Termite: https://github.com/sailuh/termite
- Python/Pyspark
- Colab/Jupyter Notebook
- Streamlit Web Application
- Cloud Platform: AWS