Search: Fix third-party integrations #98

josephjclark · 2024-09-19T16:36:28Z

Before we can deploy the search service, we need to work out who our integrations should be with

Right now, we are using a third-party vector database to store embeddings (Zilliz), and a third-party embeddings database to generate embeddings for queries (with open AI).

Our requirements are:

We need to encode our docs site as a set of embeddings in the database
We need to take user queries in natural language and convert them into embeddings
We need to search our database for matching vectors.

Self Hosted Database

I am convinced that we should be able to host our own vector database in the container.

The database should be built offline as part of a builder image. We can use whatever dev depenencies are needed in the builder image, and drop them for the final production image.

Once the build is complete, we don't need to write to the database again: we only need read and query capability.

We can even trigger a new Apollo build every time the docsite is updated to keep things in sync. But the doc site doesn't update THAT often so we don't really need a live sync. A weekly update would be fine.

We could I suppose use the database to case searches later (but even then, the database might not be the best way to do this).

We should choose an open source database from the options available.

I don't actually know how big the embeddings are, in memory size, for the docsite. But I doubt it's gigabytes?

Note that this means the apollo server needs to actually run queries against the DB. Up until now apollo has really just been a proxy server - from here it'll start doing its own actual work.

Third party Database

If we really can't bundle up our own database in the container, we'll need to use a third party.

We're currently using Zilliz, which is the SaS version of Milvus.

We should chose a partner which is open source, isn't too expensive, and ideally which aligns to our values.

Self Hosted Embeddings Model

In a perfect world we would keep the embeddings model in the image too. This would mean that the apollo sever needs to be big enough and powerful enough to run an LLM.

Note that the dev dependencies for the model don't need to be in the final production image - we shouldn't need to store torch and all its built in models.

I suspect that we can build an embeddings model in a builder image, then remove all the dev dependencies, and use a final model thats around 1GB in size.

We would need to be careful about which model we pick to ensure that a) its is ethically trained and b) it generates good quality embeddings. We can compare against openAI's embeddings and the existing milvus search to get a sense of how good they are.

The model would be called:

At build time, to generate embeddings for the doc site
At runtime, to generate embeddings for a user query

Third-party Embeddings Model

If we can't self host the model, we'll have to stick with a third party. This may well be appropriate, but we'd need a cost effective solution.

We currently use the openAI embeddings service. Anthropic recommends https://www.voyageai.com/, which I'd at least like to take a serious look at

josephjclark · 2024-09-19T16:54:43Z

I think the right strategy here, as a first pass, is:

Get Milvus embedded in the image
Use openai or voyageai for embeddings for now

josephjclark mentioned this issue Nov 13, 2024

Embeddings Service #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: Fix third-party integrations #98

Search: Fix third-party integrations #98

josephjclark commented Sep 19, 2024 •

edited

Loading

josephjclark commented Sep 19, 2024

Search: Fix third-party integrations #98

Search: Fix third-party integrations #98

Comments

josephjclark commented Sep 19, 2024 • edited Loading

Self Hosted Database

Third party Database

Self Hosted Embeddings Model

Third-party Embeddings Model

josephjclark commented Sep 19, 2024

josephjclark commented Sep 19, 2024 •

edited

Loading