From 075e2946bbf9c38b6e20a4a2c73adba22e7016a1 Mon Sep 17 00:00:00 2001
From: "Yufei (Benny) Chen" <1585539+benjibc@users.noreply.github.com>
Date: Mon, 4 Mar 2024 09:59:26 -0800
Subject: [PATCH] Koda Retriever turn key example (#79)

---
 examples/rag/mongodb_koda_retriever.ipynb | 405 ++++++++++++++++++++++
 1 file changed, 405 insertions(+)
 create mode 100644 examples/rag/mongodb_koda_retriever.ipynb

diff --git a/examples/rag/mongodb_koda_retriever.ipynb b/examples/rag/mongodb_koda_retriever.ipynb
new file mode 100644
index 0000000..42d82bd
--- /dev/null
+++ b/examples/rag/mongodb_koda_retriever.ipynb
@@ -0,0 +1,405 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Koda Retriever + MongoDB + Fireworks: package solution for best RAG data quality\n",
+    "\n",
+    "As people are building more advanced abstractions on top of RAG, we can compose that together with MongoDB and Fireworks to provide a turn key solution that provides the best quality out of the box. Koda retriever from LlamaIndex is a good example, it implements hybrid retrieval out of the box, and would only need users to some config tunings on the weight of different search methods to be able to get the best results. In this case, we will make use of\n",
+    "- Fireworks embeddings, reranker and LLMs to drive the ranking end to end\n",
+    "- MongoDB Atlas for embedding features\n",
+    "\n",
+    "For more information on how Koda Retriever pack for LlamaIndex works, please check out [LlamaIndex's page for Koda Retriever](https://github.com/run-llama/llama_index/tree/7ce7058d0f781e7ebd8f73d40e8888471f867af0/llama-index-packs/llama-index-packs-koda-retriever)\n",
+    "\n",
+    "We will start from the beginning, perform the data import, embed the data, and then use Koda Retriever on top of the imported data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Basic installations\n",
+    "We will install all the relevant dependencies for Fireworks, MongoDB, Koda Retriever"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q llama-index llama-index-llms-fireworks llama-index-embeddings-fireworks pymongo\n",
+    "!pip install -q llama-index-packs-koda-retriever llama-index-vector-stores-mongodb datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.llms.fireworks import Fireworks\n",
+    "from llama_index.core import VectorStoreIndex\n",
+    "from llama_index.embeddings.fireworks import FireworksEmbedding\n",
+    "from llama_index.core.postprocessor import LLMRerank \n",
+    "from llama_index.core import VectorStoreIndex\n",
+    "from llama_index.core import Settings\n",
+    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
+    "from llama_index.packs.koda_retriever import KodaRetriever\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pymongo\n",
+    "\n",
+    "mongo_url = input()\n",
+    "mongo_client = pymongo.MongoClient(mongo_url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DB_NAME = \"movies\"\n",
+    "COLLECTION_NAME = \"movies_records\"\n",
+    "\n",
+    "db = mongo_client[DB_NAME]\n",
+    "collection = db[COLLECTION_NAME]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "we are going to delete this collection to clean up all the previous documents inserted."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DeleteResult({'n': 18067, 'electionId': ObjectId('7fffffff00000000000001cf'), 'opTime': {'ts': Timestamp(1709534085, 5295), 't': 463}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1709534085, 5295), 'signature': {'hash': b'\\xac;\\x86\\x9b\\xbe\\xb4\\xa7\\xed\\xe6\\x03\\xc0jY\\xb8p\\xef\\x9a\\x05\\xe7\\xce', 'keyId': 7294687148333072386}}, 'operationTime': Timestamp(1709534085, 5295)}, acknowledged=True)"
+      ]
+     },
+     "execution_count": 61,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "collection.delete_many({})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set up Fireworks.ai Key\n",
+    "import os\n",
+    "import getpass\n",
+    "\n",
+    "fw_api_key = getpass.getpass(\"Fireworks API Key:\")\n",
+    "os.environ[\"FIREWORKS_API_KEY\"] = fw_api_key"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch\n",
+    "\n",
+    "Settings.llm = Fireworks()\n",
+    "Settings.embed_model = FireworksEmbedding()\n",
+    "\n",
+    "\n",
+    "vector_store = MongoDBAtlasVectorSearch(    mongo_client,\n",
+    "    db_name=DB_NAME,\n",
+    "    collection_name=COLLECTION_NAME,\n",
+    "    index_name=\"vector_index\",\n",
+    ")\n",
+    "vector_index = VectorStoreIndex.from_vector_store(\n",
+    "    vector_store=vector_store, embed_model=Settings.embed_model\n",
+    ")\n",
+    "\n",
+    "reranker = LLMRerank(llm=Settings.llm)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## data import\n",
+    "We are going to import from MongoDB's huggingface dataset with 25k restaurants"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "import pandas as pd\n",
+    "\n",
+    "# https://huggingface.co/datasets/AIatMongoDB/whatscooking.restaurants\n",
+    "dataset = load_dataset(\"AIatMongoDB/whatscooking.restaurants\")\n",
+    "\n",
+    "# Convert the dataset to a pandas dataframe\n",
+    "\n",
+    "dataset_df = pd.DataFrame(dataset[\"train\"])\n",
+    "\n",
+    "import json\n",
+    "from llama_index.core import Document\n",
+    "from llama_index.core.schema import MetadataMode\n",
+    "\n",
+    "# Convert the DataFrame to a JSON string representation\n",
+    "documents_json = dataset_df.to_json(orient=\"records\")\n",
+    "# Load the JSON string into a Python list of dictionaries\n",
+    "documents_list = json.loads(documents_json)\n",
+    "\n",
+    "llama_documents = []\n",
+    "\n",
+    "for document in documents_list:\n",
+    "    # Value for metadata must be one of (str, int, float, None)\n",
+    "    document[\"name\"] = json.dumps(document[\"name\"])\n",
+    "    document[\"cuisine\"] = json.dumps(document[\"cuisine\"])\n",
+    "    document[\"attributes\"] = json.dumps(document[\"attributes\"])\n",
+    "    document[\"menu\"] = json.dumps(document[\"menu\"])\n",
+    "    document[\"borough\"] = json.dumps(document[\"borough\"])\n",
+    "    document[\"address\"] = json.dumps(document[\"address\"])\n",
+    "    document[\"PriceRange\"] = json.dumps(document[\"PriceRange\"])\n",
+    "    document[\"HappyHour\"] = json.dumps(document[\"HappyHour\"])\n",
+    "    document[\"review_count\"] = json.dumps(document[\"review_count\"])\n",
+    "    del document[\"embedding\"]\n",
+    "    del document[\"location\"]\n",
+    "\n",
+    "    # Create a Document object with the text and excluded metadata for llm and embedding models\n",
+    "    llama_document = Document(\n",
+    "        text=json.dumps(document),\n",
+    "        metadata=document,\n",
+    "        metadata_template=\"{key}=>{value}\",\n",
+    "        text_template=\"Metadata: {metadata_str}\\n-----\\nContent: {content}\",\n",
+    "    )\n",
+    "\n",
+    "    llama_documents.append(llama_document)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "embed_model = FireworksEmbedding(\n",
+    "    embed_batch_size=512,\n",
+    "    model_name=\"nomic-ai/nomic-embed-text-v1.5\",\n",
+    "    api_key=fw_api_key,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "The Embedding model sees this: \n",
+      " Metadata: _id=>{'$oid': '6095a34a7c34416a90d3206b'}\n",
+      "DogsAllowed=>None\n",
+      "TakeOut=>True\n",
+      "sponsored=>None\n",
+      "review_count=>10\n",
+      "OutdoorSeating=>True\n",
+      "HappyHour=>null\n",
+      "cuisine=>\"Tex-Mex\"\n",
+      "PriceRange=>1.0\n",
+      "address=>{\"building\": \"627\", \"coord\": [-73.975981, 40.745132], \"street\": \"2 Avenue\", \"zipcode\": \"10016\"}\n",
+      "restaurant_id=>40366661\n",
+      "menu=>null\n",
+      "attributes=>{\"Alcohol\": \"'none'\", \"Ambience\": \"{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}\", \"BYOB\": null, \"BestNights\": null, \"BikeParking\": null, \"BusinessAcceptsBitcoin\": null, \"BusinessAcceptsCreditCards\": null, \"BusinessParking\": \"None\", \"Caters\": \"True\", \"DriveThru\": null, \"GoodForDancing\": null, \"GoodForKids\": \"True\", \"GoodForMeal\": null, \"HasTV\": \"True\", \"Music\": null, \"NoiseLevel\": \"'average'\", \"RestaurantsAttire\": \"'casual'\", \"RestaurantsDelivery\": \"True\", \"RestaurantsGoodForGroups\": \"True\", \"RestaurantsReservations\": \"True\", \"RestaurantsTableService\": \"False\", \"WheelchairAccessible\": \"True\", \"WiFi\": \"'free'\"}\n",
+      "name=>\"Baby Bo'S Burritos\"\n",
+      "borough=>\"Manhattan\"\n",
+      "stars=>2.5\n",
+      "-----\n",
+      "Content: {\"_id\": {\"$oid\": \"6095a34a7c34416a90d3206b\"}, \"DogsAllowed\": null, \"TakeOut\": true, \"sponsored\": null, \"review_count\": \"10\", \"OutdoorSeating\": true, \"HappyHour\": \"null\", \"cuisine\": \"\\\"Tex-Mex\\\"\", \"PriceRange\": \"1.0\", \"address\": \"{\\\"building\\\": \\\"627\\\", \\\"coord\\\": [-73.975981, 40.745132], \\\"street\\\": \\\"2 Avenue\\\", \\\"zipcode\\\": \\\"10016\\\"}\", \"restaurant_id\": \"40366661\", \"menu\": \"null\", \"attributes\": \"{\\\"Alcohol\\\": \\\"'none'\\\", \\\"Ambience\\\": \\\"{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}\\\", \\\"BYOB\\\": null, \\\"BestNights\\\": null, \\\"BikeParking\\\": null, \\\"BusinessAcceptsBitcoin\\\": null, \\\"BusinessAcceptsCreditCards\\\": null, \\\"BusinessParking\\\": \\\"None\\\", \\\"Caters\\\": \\\"True\\\", \\\"DriveThru\\\": null, \\\"GoodForDancing\\\": null, \\\"GoodForKids\\\": \\\"True\\\", \\\"GoodForMeal\\\": null, \\\"HasTV\\\": \\\"True\\\", \\\"Music\\\": null, \\\"NoiseLevel\\\": \\\"'average'\\\", \\\"RestaurantsAttire\\\": \\\"'casual'\\\", \\\"RestaurantsDelivery\\\": \\\"True\\\", \\\"RestaurantsGoodForGroups\\\": \\\"True\\\", \\\"RestaurantsReservations\\\": \\\"True\\\", \\\"RestaurantsTableService\\\": \\\"False\\\", \\\"WheelchairAccessible\\\": \\\"True\\\", \\\"WiFi\\\": \\\"'free'\\\"}\", \"name\": \"\\\"Baby Bo'S Burritos\\\"\", \"borough\": \"\\\"Manhattan\\\"\", \"stars\": 2.5}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\n",
+    "    \"\\nThe Embedding model sees this: \\n\",\n",
+    "    llama_documents[0].get_content(metadata_mode=MetadataMode.EMBED),\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We are going to use SentenceSplitter to split the documents. We will also reduce the size to 2.5k documents to make sure this example fits into a free MongoDB Atlas instance"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.node_parser import SentenceSplitter\n",
+    "parser = SentenceSplitter(chunk_size=4096)\n",
+    "nodes = parser.get_nodes_from_documents(llama_documents[:2500])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# embed all the nodes\n",
+    "node_embeddings = embed_model(nodes)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for idx, n in enumerate(nodes):\n",
+    "  n.embedding = node_embeddings[idx].embedding\n",
+    "  if \"_id\" in n.metadata:\n",
+    "    del n.metadata[\"_id\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vector_store.add(nodes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And add the following index in MongoDB Atlas. We will name it `vector_index`\n",
+    "\n",
+    "```json\n",
+    "{\n",
+    "  \"fields\": [\n",
+    "    {\n",
+    "      \"type\": \"vector\",\n",
+    "      \"path\": \"embedding\",\n",
+    "      \"numDimensions\": 768,\n",
+    "      \"similarity\": \"dotProduct\"\n",
+    "    }\n",
+    "  ]\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using Koda Retriever\n",
+    "\n",
+    "With all that preparation work, now we are finally going to show case Koda Retriever!\n",
+    "\n",
+    "Koda Retriever has most of the settings prepared out of the box for you, so we can plug the index, llm, reranker and make the example run out of the box. For more advanced settings on how to run KodaRetriever, please check out the guide [here](https://github.com/run-llama/llama_index/tree/7ce7058d0f781e7ebd8f73d40e8888471f867af0/llama-index-packs/llama-index-packs-koda-retriever)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "retriever = KodaRetriever(\n",
+    "    index=vector_index,\n",
+    "    llm=Settings.llm,\n",
+    "    reranker=reranker,\n",
+    "    verbose=True,\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And now we will query the retriever to find some bakery recommendations in Manhattan"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 89,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Based on the context provided, I would recommend two bakeries in Manhattan. The first one is Lung Moon Bakery, which offers a variety of baked goods such as Stuffed Croissants, Gourmet Doughnuts, Brownies, and Cookie sandwiches with icing in the middle. They have outdoor seating and take-out options available. The second recommendation is Zaro's Bread Basket, which has a selection including Stuffed Croissants, Pecan tart, Chocolate strawberries, and Lemon cupcakes. Zaro's Bread Basket also offers delivery and has a bike parking facility. Both bakeries are rated quite well by their customers.\n"
+     ]
+    }
+   ],
+   "source": [
+    "query = \"search_query: Any recommendations for bakeries in Manhattan?\"\n",
+    "query_engine = RetrieverQueryEngine.from_args(retriever=retriever)\n",
+    "response = query_engine.query(query)\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That is it! A solution with that combines query engine, embedding, llms and reranker all in just a few lines of LlamaIndex code. Fireworks is here to support you for your complete RAG journey, and please let us know if there are any other high quality RAG setups like Koda Retriever that you want us to support in our Discord channel."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}