Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSH Implementation with TFIDF Dense Matrix #101

Open
girishmt4 opened this issue Apr 24, 2018 · 5 comments
Open

LSH Implementation with TFIDF Dense Matrix #101

girishmt4 opened this issue Apr 24, 2018 · 5 comments

Comments

@girishmt4
Copy link

I am currently working on Documents similarity project. We are processing text documents to generate TFIDF Vectors for each document in the corpus. In a nutshell, we are working with DENSE DATA with the documents being the data points and TFIDF values of the terms occuring in the document as their features.
We succeeded in implementing LSH with sparse data but it's not quite efficient.
Is it possible to use FALCONN with dense data for LSH implementation?

@ludwigschmidt
Copy link
Collaborator

Yes, FALCONN supports dense data. In fact, the support for dense data is better than for sparse data. But if your data is very high-dimensional, the dense approach might not be efficient. What dimension do you work with?

@girishmt4
Copy link
Author

I am currently working with a dataset that stores the TF-IDF values for only those terms that occur in the particular document. So, every point will have different dimension.
What is your say on this?

@ludwigschmidt
Copy link
Collaborator

In that case, using a sparse representation might be better.

@girishmt4
Copy link
Author

can you explain the reason behind that? I am still wondering why sparse representation can perform better than the dense one!

@ludwigschmidt
Copy link
Collaborator

With a dense representation, the code will be performing many unnecessary multiplications with zero.

A-Guldborg pushed a commit to duckth/FOENNIX that referenced this issue May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants