Skip to content

Latest commit

 

History

History
264 lines (203 loc) · 9.23 KB

File metadata and controls

264 lines (203 loc) · 9.23 KB
#Vespa

Retrieval Augmented Generation (RAG) in Vespa

This sample application demonstrates an end-to-end Retrieval Augmented Generation application in Vespa, where all the steps are run within Vespa. No other systems are required.

This sample application focuses on the generation part of RAG, and builds upon the MS Marco passage ranking sample application. Please refer to that sample application for details on more advanced forms of retrieval, such as vector search and cross-encoder re-ranking. The generation steps in this sample application happen after retrieval, so the techniques there can easily be used in this application as well. For the purposes of this sample application, we will use a simple text search using BM25.

We will show three versions of an end-to-end RAG application here:

  1. Using an external LLM service to generate the final response.
  2. Using local LLM inference to generate the final response.
  3. Deploying to Vespa Cloud and using GPU accelerated LLM inference to generate the final response.

For details on using retrieval augmented generation in Vespa, please refer to the RAG in Vespa documentation page. For more on the general use of LLMs in Vespa, please refer to LLMs in Vespa.

Quick start

The following is a quick start recipe for getting started with a tiny slice of the MS Marco passage ranking dataset. Please follow the instructions in the MS Marco passage ranking sample application for instructions on downloading the entire dataset.

In the following we will deploy the sample application either to a local Docker (or Podman) container or to Vespa Cloud. Querying the sample application does not depend on the type of deployment, and is shown in the querying section below.

Make sure that Vespa CLI is installed. Update to the newest version:

$ brew install vespa-cli

Download this sample application:

$ vespa clone retrieval-augmented-generation rag && cd rag

Deploying to the Vespa Cloud using GPU

Deploy the sample application to Vespa Cloud on a GPU instance to perform the generative part. Note that this application can fit within the free quota, so it is free to try.

In the following section, we will set the Vespa CLI target to the cloud. Make sure you have created a tenant at console.vespa-cloud.com. Make a note of the tenant's name; it will be used in the next steps. For more information, see the Vespa Cloud getting started guide.

Add your OpenAI API key to the Vespa secret store as described in Secret Management. Unless you already have one, create a new vault, and add your OpenAI API key as a secret.

The services.xml file must refer to the newly added secret in the secret store. Replace <my-vault-name> and <my-secret-name> below with your own values:

<secrets>
    <openai-api-key vault=">my-vault-name>" name="<my-secret-name>"/>
</secrets>

Configure the vespa client. Replace tenant-name below with your tenant name. We use the application name rag-app here, but you are free to choose your own application name:

$ vespa config set target cloud
$ vespa config set application tenant-name.rag-app

Log in and add your public certificates to the application for Dataplane access:

$ vespa auth login
$ vespa auth cert

Grant application access to the secret. Applications must be created first so one can use the Vespa Cloud Console to grant access. The easiest way is to deploy, which will auto-create the application. The first deployment will fail:

$ vespa deploy --wait 900
[09:47:43] warning Deployment failed: Invalid application: Vault 'my_vault' does not exist,
or application does not have access to it

At this point, open the console (the link is like https://console.vespa-cloud.com/tenant/mytenant/account/secrets) and grant access:

edit application access dialog

Deploy the application again. This can take some time for all nodes to be provisioned:

$ vespa deploy --wait 900

The application should now be deployed! You can continue to the querying section below to test it.

Deploying locally to a Docker container

Here, we will deploy the sample application locally to a Docker or Podman container. Please ensure that either Docker or Podman is installed and running with 12 GB available memory.

Validate Docker resource settings, which should be a minimum of 12 GB:

$ docker info | grep "Total Memory"
or
$ podman info | grep "memTotal"

In the following, you can replace docker with podman and this should work out of the box.

Pull and start the most recent Vespa container image:

$ docker pull vespaengine/vespa
$ docker run --detach --name vespa-rag --hostname vespa-container \
  --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \
  vespaengine/vespa

We will use a local deployment using this docker image:

$ vespa config set target local

Verify that the configuration service (deploy API) is ready:

$ vespa status deploy --wait 300

Deploy the application. This downloads the LLM file, which can take some time. Note that if you don't want to perform local inference of the LLM, you can remove the corresponding section in services.xml so the application skips this downloading.

$ vespa deploy --wait 900

The application should now be deployed!

Querying

Let's feed the documents:

$ vespa feed ext/docs.jsonl

Run a query first to check the retrieval:

$ vespa query query="what was the manhattan project?" hits=5

Openai

To test generation using the OpenAI client, post a query that runs the openai search chain:

$ vespa query \
    --timeout 60 \
    --header="X-LLM-API-KEY:insert-api-key-here" \
    query="what was the manhattan project?" \
    hits=5 \
    searchChain=openai \
    format=sse \
    traceLevel=1

On Vespa cloud, just skip the --header parameter, as the API key is already set up in services.xml, and will be retrieved from the Vespa secret store.

Here, we specifically set the search chain to openai. This calls the RAGSearcher which is set up to use the OpenAI client. Note that this requires an OpenAI API key. We also add a timeout as token generation can take some time.

Local

To test generation using the local LLM model, post a query that runs the local search chain:

$ vespa query \
    --timeout 120 \
    query="what was the manhattan project?" \
    hits=5 \
    searchChain=local \
    format=sse \
    traceLevel=1

Note that if you are submitting this query to a local Docker deployment, it can take some time before the tokens start appearing. This is because the prompt evaluation can take a significant amount of time, particularly on CPUs without a lot of cores. To alleviate this a bit, you can reduce the number of hits retrieved by Vespa to, for instance, 3.

Prompt evaluation and token generation are much more efficient on the GPU.

Query parameters

The parameters here are:

  • query: the query used both for retrieval and the prompt question.
  • hits: the number of hits that Vespa should return in the retrieval stage
  • searchChain: the search chain set up in services.xml that calls the generative process
  • format: sets the format to server-sent events, which will stream the tokens as they are generated.
  • traceLevel: outputs some debug information, such as the actual prompt that was sent to the LLM and token timing.

For more information on how to customize the prompt, please refer to the RAG in Vespa documentation.

Shutdown and remove the RAG application

For the local deployments, shutdown and remove this container:

$ docker rm -f vespa-rag

To remove the application from Vespa Cloud:

$ vespa destroy