Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Fix third-party integrations #98

Open
josephjclark opened this issue Sep 19, 2024 · 1 comment
Open

Search: Fix third-party integrations #98

josephjclark opened this issue Sep 19, 2024 · 1 comment

Comments

@josephjclark
Copy link
Collaborator

josephjclark commented Sep 19, 2024

Before we can deploy the search service, we need to work out who our integrations should be with

Right now, we are using a third-party vector database to store embeddings (Zilliz), and a third-party embeddings database to generate embeddings for queries (with open AI).

Our requirements are:

  • We need to encode our docs site as a set of embeddings in the database
  • We need to take user queries in natural language and convert them into embeddings
  • We need to search our database for matching vectors.

Self Hosted Database

I am convinced that we should be able to host our own vector database in the container.

The database should be built offline as part of a builder image. We can use whatever dev depenencies are needed in the builder image, and drop them for the final production image.

Once the build is complete, we don't need to write to the database again: we only need read and query capability.

We can even trigger a new Apollo build every time the docsite is updated to keep things in sync. But the doc site doesn't update THAT often so we don't really need a live sync. A weekly update would be fine.

We could I suppose use the database to case searches later (but even then, the database might not be the best way to do this).

We should choose an open source database from the options available.

I don't actually know how big the embeddings are, in memory size, for the docsite. But I doubt it's gigabytes?

Note that this means the apollo server needs to actually run queries against the DB. Up until now apollo has really just been a proxy server - from here it'll start doing its own actual work.

Third party Database

If we really can't bundle up our own database in the container, we'll need to use a third party.

We're currently using Zilliz, which is the SaS version of Milvus.

We should chose a partner which is open source, isn't too expensive, and ideally which aligns to our values.

Self Hosted Embeddings Model

In a perfect world we would keep the embeddings model in the image too. This would mean that the apollo sever needs to be big enough and powerful enough to run an LLM.

Note that the dev dependencies for the model don't need to be in the final production image - we shouldn't need to store torch and all its built in models.

I suspect that we can build an embeddings model in a builder image, then remove all the dev dependencies, and use a final model thats around 1GB in size.

We would need to be careful about which model we pick to ensure that a) its is ethically trained and b) it generates good quality embeddings. We can compare against openAI's embeddings and the existing milvus search to get a sense of how good they are.

The model would be called:

  • At build time, to generate embeddings for the doc site
  • At runtime, to generate embeddings for a user query

Third-party Embeddings Model

If we can't self host the model, we'll have to stick with a third party. This may well be appropriate, but we'd need a cost effective solution.

We currently use the openAI embeddings service. Anthropic recommends https://www.voyageai.com/, which I'd at least like to take a serious look at

@josephjclark
Copy link
Collaborator Author

I think the right strategy here, as a first pass, is:

  • Get Milvus embedded in the image
  • Use openai or voyageai for embeddings for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant