Expectation of transform results for repeated document runs #630

dereke55 · 2022-07-21T17:59:59Z

dereke55
Jul 21, 2022

Given a trained model, if the same document(s) is passed to transform, is the result of the transform method expected to be the same for each run (assuming same model each time)?

For example, if I have trained a model and call transform for a set of 10 documents, should the result of those 10 documents be the same EVERY time (of course the topics of the 10 documents may differ).

There are no side effects of predicting documents correct? And there are no assumptions of the documents correct?

The reason I ask is because currently, I am seeing some documents receiving the -1 unknown topic_id, but then on subsequent runs, a valid topic_id is predicted. Before digging into further, I want to make sure my expectation is correct.

Thank you,
Derek

Here's a snippet of code if this helps:

model = None

def train_model(docs):
   vectorizer_model = CountVectorizer(
        min_df=5,
        ngram_range=(1, 3),
    )

    global model
    model = bertopic.BERTopic(
        verbose=True,
        calculate_probabilities=False,
        vectorizer_model=vectorizer_model,
    )

    topic_model.fit_transform(docs)


def predict(docs):
    global model
    # given the SAME docs EVERY time, should the (topic, probs) results be the same?
    topic, probs = model.transform(docs)

drob-xx · 2022-07-21T21:56:28Z

drob-xx
Jul 21, 2022

If I'm not mistaken the reason is that UMAP - the algo which reduces the BERT embeddings to a manageable dimension (5) for HDBSCAN is stochastic - meaning that it relies on randomly generated values that effect output over multiple runs. This is expected and is likely what is causing the behavior you are seeing.

The only way to deal with this is to seed UMAP with a static value to control its random number generator. It doesn't matter what value you use - just that you use the same number across all runs that you want to produce the same output for. You can do this by instantiating a UMAP instance before calling BERTopic.fit() or BERTopic.fit_transform(). However, I prefer to do this by setting the value directly after a new BERTopic object has been created. In your case it would look like this:

global model
model = bertopic.BERTopic(
    verbose=True,
    calculate_probabilities=False,
    vectorizer_model=vectorizer_model,
)
model.umap_model.random_state=42

3 replies

dereke55 Jul 22, 2022
Author

Thank you for the response. I'll try the random_state parameter. I know I need to do some additional parameter tweaking as well.

Follow up question (which you may not know), it is standard for users to set the random_state parameter? Or is there a good use-case when to use it?

drob-xx Jul 22, 2022

Like everything else, it depends. It is really specific to your use-case. With the stuff I do I don't see much sense in setting it. The 'drift' I see from one run to the next is minimal and I'm not much interested in classifying a particular document with a high degree of accuracy. Mostly I think it is useful if you are running some sort of comparison or test, already understand the statistical looseness of the underlying algorithm and just want to stabilize your outputs so you can concentrate on other issues.

dereke55 Jul 22, 2022
Author

Of course. I greatly appreciate your explanations, recommendations, and overall help on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expectation of transform results for repeated document runs #630

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Expectation of transform results for repeated document runs #630

dereke55 Jul 21, 2022

Replies: 1 comment · 3 replies

drob-xx Jul 21, 2022

dereke55 Jul 22, 2022 Author

drob-xx Jul 22, 2022

dereke55 Jul 22, 2022 Author

dereke55
Jul 21, 2022

Replies: 1 comment 3 replies

drob-xx
Jul 21, 2022

dereke55 Jul 22, 2022
Author

dereke55 Jul 22, 2022
Author