Skip to content

quangngoc/text-clustering

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text clustering: HDBSCAN is probably all you need

License Python 3.9+ Code style: black

Goal

Segment common items in a text dataset to pinpoint core themes and their distribution.

  • Clusters cover the main topics/subtopics in the dataset
  • Clusters backed by accurate, LLM generated summaries

Background

We employ HDBSCAN for probabilistic clustering. This algorithm is advantageous in many ways, including:

  • Don’t be wrong: Cluster can have varying densities, don’t need to be globular, and won’t include noise
  • Intuitive parameters: Choosing a minimum cluster size is very reasonable, and the number of k clusters does not need to be specified (HDBSCAN finds the optimal k for you)
  • Stability: HDBSCAN is stable over runs and subsampling and has good stability over parameter choices
  • Performance: When implemented well HDBSCAN can be very efficient; the current implementation has similar performance to fastcluster’s agglomerative clustering

See the HDBSCAN docs on comparing clustering algorithms and how hdbscan works for more information.

Citations

Experiments

1. Visualizing core themes in fka/awesome-chatgpt-prompts

These figures correspond to experiments/02_09_2023_16_54_32

Open In Colab


Figure 1. HDBSCAN splits the 153 text to text prompts from fka/awesome-chatgpt-prompts into two clusters: Cluster 1 with 44 prompts (orange) and Cluster 2 with 105 prompts (blue). The 4 remaining prompts (gray) were filtered out as outliers/noise.

Figure 2. The most persistent prompts in each leaf cluster are known as "exemplars". These represent the hearts around which the ultimate cluster formed. See the HDBSCAN docs on soft clustering explanation for supporting information and functions.

Figure 3. Additional clustering is conducted around the exemplars to identify sub-topics in the dataset. The cases in each sub-cluster subsequently serve as retrieved context for the LLM theme summarization calls below.

Figure 4. Visualizing the "Computer Programming and Software Development" theme, which covers 13% of the dataset. The summary was generated by gpt-3.5-turbo-16k. The above was created with jsoncrack.com/editor.


These figures correspond to experiments/04_09_2023_03_02_25

Open In Colab

HDBSCAN splits the 73,718 text to image prompts from gustavosta/stable-diffusion-prompts into 78 clusters with 25,019 (33%) of the dataset represented. The remaining 48,699 (66%) were filtered out as outliers/noise. The 5 largest clusters cover 9.5% of the dataset - these are the segments we will examine for drift below.

cluster id theme
56 Portraits and artistic depictions of female anime characters, beautiful women, and fashionable young women
13 Symmetrical portraits of people, characters, and sci-fi figures
61 Futuristic sci-fi spaceship concept art
50 Portraits of famous actresses as characters in various roles, outfits, and styles
74 Surreal, cinematic, and futuristic digital art
cluster id train count
(73.7k rows)
test count
(8.19k rows)
drift detection
(% change)
56 2530 (3.43%) 310 (3.79%) 10.50
13 1343 (1.82%) 149 (1.82%) 0.00
61 1287 (1.75%) 131 (1.60%) -8.57
50 1055 (1.43%) 135 (1.65%) 15.38
74 749 (1.02%) 109 (1.33%) 30.39

Tables 1 & 2. Drift detection for the top 5 largest clusters (bottom), alongside their claude-2 summaries (top).


Prompt: "Beautiful painting of an Aspen forest at sunset, digital art, award winning illustration, golden hour, smooth, sharp lines, concept art, trending on artstation"
Model: Runway Gen-2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Beautiful landscape paintings and matte art (cluster id: 75)


Prompt: "Futuristic batman, brush strokes, oil painting, greg rutkowski"
Model: Midjourney V5.2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Art and portraits of Batman characters (cluster id: 41)

Prompt: "Futuristic Porsche designed by Apple, a detailed matte painting by Kitagawa Utamaro, cgsociety, octane render, highly detailed, matte painting, concept art, sci-fi"
Model: Midjourney V5.2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Futuristic and fantasy vehicle concept art (cluster id: 52)

Figure 5. A sample of 3 text to image generations with various models for prompts from the gustavosta/stable-diffusion-prompts dataset (alongside their cluster id).

About

Text clustering: HDBSCAN is probably all you need.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%