nr_samples #1492
-
Hi, I am enjoying my research thanks to the BERTopic you created. I'm just posting a question because I need to check something before citing your framework in my paper. When describing the simple algorithm of KeyBERTInspired, you said that "(...) we randomly sample a number of candidate documents per cluster which is controlled by the nr_samples parameter. (...)". The size of each topic (cluster) can vary. Some topics might be 10, while others might be 10,000. From a logical point of view, it seems impossible to set the parameter higher than the size of the smallest topic, and setting it lower would cause representativeness issues from a statistical point of view. Nevertheless, setting the parameter to 500 seems to produce results for all topics, and I'm confused as to how this is possible. Am I misunderstanding the algorithm or the meaning of the parameter? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
If a cluster contains 10 documents and we want to sample 500 documents, it will simply extract all 10 documents from that specific cluster. So setting it higher poses no issues. |
Beta Was this translation helpful? Give feedback.
If a cluster contains 10 documents and we want to sample 500 documents, it will simply extract all 10 documents from that specific cluster. So setting it higher poses no issues.