nr_samples #1492

Realfriend-Logistics · 2023-08-29T03:05:37Z

Realfriend-Logistics
Aug 29, 2023

Hi, I am enjoying my research thanks to the BERTopic you created.

I'm just posting a question because I need to check something before citing your framework in my paper.

When describing the simple algorithm of KeyBERTInspired, you said that "(...) we randomly sample a number of candidate documents per cluster which is controlled by the nr_samples parameter. (...)".

The size of each topic (cluster) can vary. Some topics might be 10, while others might be 10,000.

From a logical point of view, it seems impossible to set the parameter higher than the size of the smallest topic, and setting it lower would cause representativeness issues from a statistical point of view.

Nevertheless, setting the parameter to 500 seems to produce results for all topics, and I'm confused as to how this is possible. Am I misunderstanding the algorithm or the meaning of the parameter?

Answered by MaartenGr

Aug 29, 2023

If a cluster contains 10 documents and we want to sample 500 documents, it will simply extract all 10 documents from that specific cluster. So setting it higher poses no issues.

View full answer

MaartenGr · 2023-08-29T11:53:22Z

MaartenGr
Aug 29, 2023
Maintainer

If a cluster contains 10 documents and we want to sample 500 documents, it will simply extract all 10 documents from that specific cluster. So setting it higher poses no issues.

2 replies

Realfriend-Logistics Aug 29, 2023
Author

Aha, so if the size of the cluster is smaller than the parameters I set, all the documents in that cluster are sampled? If that's correct, then your statement that there are no statistical issues in this case makes sense. Thanks, your framework is very nice and useful.

MaartenGr Aug 29, 2023
Maintainer

That's correct! Thanks for the kind words 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nr_samples #1492

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

nr_samples #1492

Realfriend-Logistics Aug 29, 2023

Replies: 1 comment · 2 replies

MaartenGr Aug 29, 2023 Maintainer

Realfriend-Logistics Aug 29, 2023 Author

MaartenGr Aug 29, 2023 Maintainer

Realfriend-Logistics
Aug 29, 2023

Replies: 1 comment 2 replies

MaartenGr
Aug 29, 2023
Maintainer

Realfriend-Logistics Aug 29, 2023
Author

MaartenGr Aug 29, 2023
Maintainer