diff --git a/examples/rag/assets/New Project.png b/examples/rag/assets/New Project.png new file mode 100644 index 0000000..98f8fba Binary files /dev/null and b/examples/rag/assets/New Project.png differ diff --git a/examples/rag/assets/create_secret.png b/examples/rag/assets/create_secret.png new file mode 100644 index 0000000..a8c0c3f Binary files /dev/null and b/examples/rag/assets/create_secret.png differ diff --git a/examples/rag/assets/create_trigger.png b/examples/rag/assets/create_trigger.png new file mode 100644 index 0000000..f0be315 Binary files /dev/null and b/examples/rag/assets/create_trigger.png differ diff --git a/examples/rag/assets/create_value.png b/examples/rag/assets/create_value.png new file mode 100644 index 0000000..6e41638 Binary files /dev/null and b/examples/rag/assets/create_value.png differ diff --git a/examples/rag/mongodb_triggers.ipynb b/examples/rag/mongodb_triggers.ipynb new file mode 100644 index 0000000..9569c79 --- /dev/null +++ b/examples/rag/mongodb_triggers.ipynb @@ -0,0 +1,356 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# RAG with Fireworks and MongoDB Triggers\n", + "\n", + "When you are working with a realtime application, you want to be able to embed your dataset in real time as they are inserted. Fireworks can help with that when used together with MongoDB triggers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Trigger Setup\n", + "### Adding Fireworks Key to Values\n", + "API keys are an important part of the overall setup, and we want to store the secrets properly. We want to obtain a Fireworks API key and then store them as Values in MongoDB Atlas\n", + "\n", + "First, login to [Fireworks](https://fireworks.ai/) , register and then get a free API key to start with\n", + "\n", + "![New Project.png](./assets/New%20Project.png)\n", + "\n", + "You want to store this API key into Atlas as a secret. Note that you cannot directly access the secret in triggers. So we will create a secret with the name fireworksApiKey\n", + "![image-2.png](./assets/create_secret.png)\n", + "\n", + "And then create a value that is linked to the secret, and we will call this value FIREWORKS_API_KEY\n", + "![image.png](./assets/create_value.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating the trigger to fetch embeddings\n", + "Now we will create the trigger for fetching the embeddings. In order to do that, we will make use of the sample restaurant dataset from Atlas. More details on how to load the dataset please check out [this guide](https://www.mongodb.com/docs/atlas/sample-data/#std-label-available-sample-datasets). For now we will assume the Sample Restaurants Dataset is loaded, and each of the restaurant object looks like this\n", + "```json\n", + "{\n", + " \"_id\": {\n", + " \"$oid\": \"5eb3d668b31de5d588f4292a\"\n", + " },\n", + " \"address\": {}, \n", + " \"borough\": \"Brooklyn\",\n", + " \"cuisine\": \"American\",\n", + " \"grades\": [],\n", + " \"name\": \"Riviera Caterer\",\n", + " \"restaurant_id\": \"40356018\"\n", + "}\n", + "```\n", + "\n", + "We will create a new trigger named `Restaurant-Trigger`. We will point that trigger to the MongoDB database `sample_restaurants` and collection `restaurants`. Make sure `Full Document` is supported, since we need the whole document to be able to generate a good embedding for it.\n", + "\n", + "![image.png](./assets/create_trigger.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The trigger code\n", + "\n", + "In order to build the trigger code, we need the basic function to \n", + "- query the embedding on fireworks\n", + "- store the embedding on the object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```typescript\n", + "exports = async function(changeEvent) {\n", + " const url = 'https://api.fireworks.ai/inference/v1/embeddings';\n", + "\n", + " // Fetch the FIREWORKS key stored in the context values.\n", + " const fireworks_api_key = context.values.get(\"FIREWORKS_API_KEY\");\n", + "\n", + " // Access the _id of the changed document:\n", + " const docId = changeEvent.documentKey._id;\n", + " const doc = changeEvent.fullDocument;\n", + " \n", + " // Skip processing if the document already has an embedding\n", + " if ('embedding' in doc) {\n", + " console.log(\"Document already has an embedding, skipping processing.\");\n", + " return;\n", + " }\n", + "\n", + "\n", + "\n", + " // Prepare the request string for the Fireworks' nomic-ai model.\n", + " // For generating document embedding, make sure we prefix with search_document.\n", + " // for more information on how to query nomic-ai model, please check\n", + " // https://huggingface.co/nomic-ai/nomic-embed-text-v1.5\n", + " // TODO: if key embedding is in doc, ignore the doc\n", + " const reqString = `search_document: ` + JSON.stringify(doc)\n", + " console.log(`reqString: ${reqString}`);\n", + "\n", + " // Call Fireworks API to get the response.\n", + "\n", + " let resp = await context.http.post({\n", + " url: url,\n", + " headers: {\n", + " 'Authorization': [`Bearer ${fireworks_api_key}`],\n", + " 'Content-Type': ['application/json']\n", + " },\n", + " body: JSON.stringify({\n", + " model: \"nomic-ai/nomic-embed-text-v1.5\",\n", + " input: reqString,\n", + " })\n", + " });\n", + "\n", + " // Parse the JSON response\n", + " let responseData = JSON.parse(resp.body.text());\n", + "\n", + " // Check the response status.\n", + " if(resp.statusCode === 200) {\n", + " console.log(\"Successfully received code.\");\n", + " console.log(JSON.stringify(responseData));\n", + " console.log(docId);\n", + " const collection = context.services.get(\"test-rag-cluster\").db(\"sample_restaurants\").collection(\"restaurants\");\n", + " await collection.updateOne({ _id: docId }, { $set: { embedding: responseData.data[0].embedding } });\n", + " return true;\n", + "\n", + " } else {\n", + " console.log(\"Show status code.\");\n", + " console.log(resp.statusCode);\n", + " return false;\n", + " }\n", + "}" + ] + }, + { + "attachments": { + "image.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a new function with the definition above, hit save.\n", + "Now I can trigger some random update on the doc, and wait for the trigger to execute. We can see that the emebdding field is now filled **automatically**!\n", + "\n", + "![image.png](attachment:image.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### reload the dataset and setup the index\n", + "\n", + "Now let's try to refresh the dataset and then query the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "!pip install -q pymongo fireworks-ai tqdm openai" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pymongo\n", + "\n", + "mongo_url = input()\n", + "client = pymongo.MongoClient(mongo_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we are going to trigger some updates to get the embeddings generated" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Completed updates.\n" + ] + } + ], + "source": [ + "# Specify the database and collection\n", + "db = client.sample_restaurants\n", + "collection = db.restaurants\n", + "\n", + "# Iterate over the first 100 documents\n", + "for doc in collection.find().limit(100):\n", + " # Perform a minor update (e.g., adding a temporary field and then removing it)\n", + " collection.update_one({'_id': doc['_id']}, {'$set': {'temp_field': 1}})\n", + " collection.update_one({'_id': doc['_id']}, {'$unset': {'temp_field': \"\"}})\n", + "\n", + "print(\"Completed updates.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should be able to see that in our logs there are many, many triggers being called, and all the objects that has been touched, now has an embedding associated with it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Querying the embedded documents\n", + "Now let's make use of the examples we had in the previous cookbooks to find the best burger joint in Brooklyn. Make sure the vector index is setup as follows. We will name it `restaurant_index`\n", + "\n", + "```json\n", + "{\n", + " \"fields\": [\n", + " {\n", + " \"type\": \"vector\",\n", + " \"path\": \"embedding\",\n", + " \"numDimensions\": 768,\n", + " \"similarity\": \"dotProduct\"\n", + " }\n", + " ]\n", + "}\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "From your query \"What is the best burger joint in Brooklyn?\", the following restaurants were found:\n", + "\n", + "1. White Castle\n", + "2. Sonny'S Heros\n", + "3. Wendy'S\n", + "4. Shashemene Int'L Restaura\n", + "5. Taste The Tropics Ice Cream\n", + "6. Mejlander & Mulgannon\n", + "7. Shell Lanes\n", + "8. Wilken'S Fine Food\n", + "9. Carvel Ice Cream\n", + "10. Seuda Foods\n" + ] + } + ], + "source": [ + "from typing import List\n", + "embedding_model_string = \"nomic-ai/nomic-embed-text-v1.5\"\n", + "vector_database_field_name = \"embedding\"\n", + "\n", + "def generate_embeddings(input_texts: str, model_api_string: str, prefix=\"\") -> List[float]:\n", + " \"\"\"Generate embeddings from Fireworks python library\n", + "\n", + " Args:\n", + " input_texts: a list of string input texts.\n", + " model_api_string: str. An API string for a specific embedding model of your choice.\n", + " prefix: what prefix to attach to the generate the embeddings, which is required for nomic 1.5. Please check out https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage for more information\n", + "\n", + " Returns:\n", + " reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.\n", + " \"\"\"\n", + " if prefix:\n", + " input_texts = [prefix + text for text in input_texts] \n", + " return fw_client.embeddings.create(\n", + " input=input_texts,\n", + " model=model_api_string,\n", + " ).data[0].embedding\n", + "\n", + "import openai\n", + "fw_client = openai.OpenAI(\n", + " api_key=input(),\n", + " base_url=\"https://api.fireworks.ai/inference/v1\"\n", + ")\n", + "\n", + "# Example query.\n", + "query = \"What is the best burger joint in Brooklyn?\"\n", + "prefix=\"search_query: \"\n", + "query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)\n", + "\n", + "results = collection.aggregate([\n", + " {\n", + " \"$vectorSearch\": {\n", + " \"queryVector\": query_emb,\n", + " \"path\": vector_database_field_name,\n", + " \"numCandidates\": 100, # this should be 10-20x the limit\n", + " \"limit\": 10, # the number of documents to return in the results\n", + " \"index\": 'restaurant_index', # the index name you used in the earlier step\n", + " }\n", + " }\n", + "])\n", + "results_as_dict = {doc['name']: doc for doc in results}\n", + "\n", + "print(f\"From your query \\\"{query}\\\", the following restaurants were found:\\n\")\n", + "print(\"\\n\".join([str(i+1) + \". \" + name for (i, name) in enumerate(results_as_dict.keys())]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looks like the model decided that White Castle is the best burger joint in Brooklyn, which I cannot argue with.\n", + "\n", + "## What's Next\n", + "- Triggers can actually be very sophisticated and can be used for further processing. If you are interested, please check out the MongoDB blog [here](https://www.mongodb.com/developer/products/mongodb/atlas-open-ai-review-summary/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}