Update MongoDB tutorials round 2 (#76)

fw-ai · Mar 1, 2024 · 005b51a · 005b51a
1 parent 5f491e3
commit 005b51a
Show file tree

Hide file tree

Showing 2 changed files with 491 additions and 80 deletions.
diff --git a/examples/rag/mongo_basic.ipynb b/examples/rag/mongo_basic.ipynb
@@ -1,13 +1,22 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/fw-ai/cookbook/blob/main/examples/rag/mongo_basic.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# Movie recommender example with Fireworks + MongoDB + Nomic embedding model\n",
     "\n",
     "## Introduction\n",
-    "In this tutorial, we'll explore how to create an advanced movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data."
+    "In this tutorial, we'll explore how to create a basic movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data."
    ]
   },
   {
@@ -20,38 +29,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Requirement already satisfied: pymongo in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (4.6.1)\n",
-      "Requirement already satisfied: openai in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (1.9.0)\n",
-      "Requirement already satisfied: tqdm in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (4.66.1)\n",
-      "Requirement already satisfied: dnspython<3.0.0,>=1.16.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from pymongo) (2.5.0)\n",
-      "Requirement already satisfied: anyio<5,>=3.5.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (4.2.0)\n",
-      "Requirement already satisfied: distro<2,>=1.7.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (1.9.0)\n",
-      "Requirement already satisfied: httpx<1,>=0.23.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (0.25.2)\n",
-      "Requirement already satisfied: pydantic<3,>=1.9.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (2.5.3)\n",
-      "Requirement already satisfied: sniffio in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (1.3.0)\n",
-      "Requirement already satisfied: typing-extensions<5,>=4.7 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (4.9.0)\n",
-      "Requirement already satisfied: idna>=2.8 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (3.6)\n",
-      "Requirement already satisfied: exceptiongroup>=1.0.2 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (1.2.0)\n",
-      "Requirement already satisfied: certifi in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (2023.11.17)\n",
-      "Requirement already satisfied: httpcore==1.* in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (1.0.2)\n",
-      "Requirement already satisfied: h11<0.15,>=0.13 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n",
-      "Requirement already satisfied: annotated-types>=0.4.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.6.0)\n",
-      "Requirement already satisfied: pydantic-core==2.14.6 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (2.14.6)\n",
       "\n",
       "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
       "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
      ]
     }
    ],
    "source": [
-    "!pip install pymongo openai tqdm"
+    "!pip install -q pymongo fireworks-ai tqdm openai"
    ]
   },
   {
@@ -64,11 +56,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
-    "import openai\n",
     "import pymongo\n",
     "\n",
     "mongo_url = input()\n",
@@ -77,10 +68,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [],
    "source": [
+    "import openai\n",
     "fw_client = openai.OpenAI(\n",
     "  api_key=input(),\n",
     "  base_url=\"https://api.fireworks.ai/inference/v1\"\n",
@@ -91,35 +83,38 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Understanding the Nomic-ai 1.5 Model\n",
+    "## Indexing and retrieval for movies.\n",
+    "We are going to build a model to index and retrieve movie recommendations. We will setup the most basic RAG example on top of MongoDB which involves\n",
+    "- MongoDB Atlas database that indexes movies based on embeddings\n",
+    "- a system for document embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns embeddings.\n",
+    "- a basic search engine that responds to user query by embedding the user query, fetching the corresponding movies, and then use an LLM to generate the recommendations.\n",
     "\n",
-    "The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model that has support for Matryoshka Representation Learning, which means you can change your embedding dimensions from 768 all the way down to 64, and to the quality/data trade-off you need.\n",
+    "## Understanding the Nomic-ai 1.5 Model\n",
     "\n",
-    "## Embedding Generation Function\n",
-    "The core of our recommender system is embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns dimensionality-reduced embeddings."
+    "The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model embedding model. It has other features such as dimensionality reduction, but needs some special prefixes to be used properly, which we can get into in the next section"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
     "from typing import List\n",
     "\n",
     "def generate_embeddings(input_texts: List[str], model_api_string: str, prefix=\"\") -> List[List[float]]:\n",
-    "    \"\"\"Generate embeddings from Fireworks python library and reduce their size by averaging adjacent elements.\n",
+    "    \"\"\"Generate embeddings from Fireworks python library\n",
     "\n",
     "    Args:\n",
     "        input_texts: a list of string input texts.\n",
     "        model_api_string: str. An API string for a specific embedding model of your choice.\n",
+    "        prefix: what prefix to attach to the generate the embeddings, which is required for nomic 1.5. Please check out https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage for more information\n",
     "\n",
     "    Returns:\n",
     "        reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.\n",
     "    \"\"\"\n",
     "    if prefix:\n",
     "        input_texts = [prefix + text for text in input_texts] \n",
-    "        print(\"show updated input texts\", input_texts)\n",
     "    return [x.embedding for x in \n",
     "        fw_client.embeddings.create(\n",
     "        input=input_texts,\n",
@@ -137,7 +132,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
@@ -149,34 +144,19 @@
     }
    ],
    "source": [
-    "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5' # model API string from Together.\n",
-    "vector_database_field_name = 'embedding_2k_movies_fw_nomic_1_5' # define your embedding field name.\n",
-    "NUM_DOC_LIMIT = 400 # the number of documents you will process and generate embeddings.\n",
+    "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5'\n",
+    "vector_database_field_name = 'embeddings' # define your embedding field name.\n",
+    "NUM_DOC_LIMIT = 2000 # the number of documents you will process and generate embeddings.\n",
     "\n",
     "sample_output = generate_embeddings([\"This is a test.\"], embedding_model_string)\n",
     "print(f\"Embedding size is: {str(len(sample_output[0]))}\")\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Document Processing : 0it [00:00, ?it/s]"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Document Processing : 400it [00:35, 11.24it/s]\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "from tqdm import tqdm\n",
     "from datetime import datetime\n",
@@ -193,7 +173,7 @@
     ").limit(NUM_DOC_LIMIT), desc=\"Document Processing \"):\n",
     "  extracted_str = \"\\n\".join([k + \": \" + str(doc[k]) for k in keys_to_extract if k in doc])\n",
     "  if vector_database_field_name not in doc:\n",
-    "    doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string)[0]\n",
+    "    doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string, \"search_document: \")[0]\n",
     "  collection.replace_one({'_id': doc['_id']}, doc)\n"
    ]
   },
@@ -207,16 +187,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "'\\n{\\n  \"fields\": [\\n    {\\n      \"type\": \"vector\",\\n      \"path\": \"embedding_2k_movies_fw_e5_mistral\",\\n      \"numDimensions\": 2048,\\n      \"similarity\": \"dotProduct\"\\n    }\\n  ]\\n}\\n\\n'"
+       "'\\n{\\n  \"fields\": [\\n    {\\n      \"type\": \"vector\",\\n      \"path\": \"embeddings\",\\n      \"numDimensions\": 768,\\n      \"similarity\": \"dotProduct\"\\n    }\\n  ]\\n}\\n\\n'"
       ]
      },
-     "execution_count": 6,
+     "execution_count": 21,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -227,7 +207,7 @@
     "  \"fields\": [\n",
     "    {\n",
     "      \"type\": \"vector\",\n",
-    "      \"path\": \"embedding_2k_movies_fw_nomic_1_5\",\n",
+    "      \"path\": \"embeddings\",\n",
     "      \"numDimensions\": 768,\n",
     "      \"similarity\": \"dotProduct\"\n",
     "    }\n",
@@ -247,33 +227,33 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 34,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "show updated input texts ['Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\\nQuery: I love superhero movies, any recommendations?']\n",
-      "From your query \"I love superhero movies, any recommendations?\", the following movie listings were found:\n",
+      "show updated input texts ['search_query: I like Christmas movies, any recommendations?']\n",
+      "From your query \"I like Christmas movies, any recommendations?\", the following movie listings were found:\n",
       "\n",
-      "1. Spider-Man\n",
-      "2. Lara Croft: Tomb Raider\n",
-      "3. Monkeybone\n",
-      "4. Fantastic Four\n",
-      "5. Titan A.E.\n",
-      "6. Sinbad: Legend of the Seven Seas\n",
-      "7. Charlie's Angels\n",
-      "8. X-Men\n",
-      "9. Alaska: Spirit of the Wild\n",
-      "10. Planet of the Apes\n"
+      "1. Surviving Christmas\n",
+      "2. Christmas Carol: The Movie\n",
+      "3. How the Grinch Stole Christmas\n",
+      "4. 'Twas the Night\n",
+      "5. Love Actually\n",
+      "6. Dead End\n",
+      "7. Bad Santa\n",
+      "8. 'R Xmas\n",
+      "9. Casper's Haunted Christmas\n",
+      "10. The Ultimate Christmas Present\n"
      ]
     }
    ],
    "source": [
     "# Example query.\n",
-    "query = \"I love superhero movies, any recommendations?\"\n",
-    "prefix=\"Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\\nQuery: \"\n",
+    "query = \"I like Christmas movies, any recommendations?\"\n",
+    "prefix=\"search_query: \"\n",
     "query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)[0]\n",
     "\n",
     "results = collection.aggregate([\n",
@@ -283,7 +263,7 @@
     "      \"path\": vector_database_field_name,\n",
     "      \"numCandidates\": 100, # this should be 10-20x the limit\n",
     "      \"limit\": 10, # the number of documents to return in the results\n",
-    "      \"index\": 'movie_index', # the index name you used in Step 4, here we default to basics\n",
+    "      \"index\": 'movie', # the index name you used in the earlier step\n",
     "    }\n",
     "  }\n",
     "])\n",
@@ -304,14 +284,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 37,
    "metadata": {},
    "outputs": [],
    "source": [
     "your_task_prompt = (\n",
-    "    \"From the given movie listing data, choose a great movie recommendation for superhero movies. \"\n",
-    "    \"I don't like spider man though. \"\n",
-    "    \"Tell me the name of the movie and why it works for me.\"\n",
+    "    \"From the given movie listing data, choose a few great movie recommendation given the user query. \"\n",
+    "    f\"User query: {query}\"\n",
     ")\n",
     "\n",
     "listing_data = \"\"\n",
@@ -332,14 +311,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 38,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Based on your preference to exclude Spider-Man movies, I would recommend \"X-Men.\" This movie is a great choice for superhero fans as it features a team of mutants with unique abilities who fight to protect humanity from a dangerous terrorist organization. The film features impressive special effects, engaging action sequences, and well-developed characters, making it an exciting and entertaining viewing experience. Additionally, the themes of acceptance and prejudice add depth to the story, making it a great pick for those who enjoy thought-provoking superhero movies.\n"
+      "Based on the user's query, I would recommend the following Christmas movies from the provided data:\n",
+      "\n",
+      "1. \"Love Actually\" - A romantic comedy that takes place in the five weeks preceding Christmas, following the lives of eight couples in dealing with their love lives in various interrelated tales all set in London, England.\n",
+      "2. \"How the Grinch Stole Christmas\" - A live-action adaptation of Dr. Seuss's classic holiday tale about a green, revenge-seeking Grinch who decides to ruin Christmas for the cheery residents of Whoville.\n",
+      "3. \"Surviving Christmas\" - A comedy about a wealthy Chicago advertisement executive who, after being left by his girlfriend right before Christmas, hires a family to spend the holiday with him in his childhood home.\n",
+      "4. \"Christmas Carol: The Movie\" - An animated retelling of Charles Dickens' classic story, where Ebenezer Scrooge learns\n"
      ]
     }
    ],
@@ -357,8 +341,9 @@
    "metadata": {},
    "source": [
     "## Conclusion\n",
-    "And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the Mistral E5 embedding model. This system can be further customized and scaled to suit various needs.\n",
-    "\n"
+    "And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the nomic-ai embedding model. This system can be further customized and scaled to suit various needs. There are still a few things that is missing in our guides\n",
+    "- we used the default 768 embedding dimension in the example. There are cases where the cost for storing the embedding is high, and you might want to reduce that, and we will walk you through another example with MongoDB + leveraging Matryoshka embedding to reduce embedding size in [this guide](examples/rag/mongo_reduced_embeddings.ipynb)\n",
+    "- we are only documenting 400 movies in this example, which is not a lot. This is because we wanted to keep this tutorial simple and not batching the embedding lookups, and just have a for loop that goes through all the documents and embed them manually. This method does not scale. First, we will cover basic batching in the [following guide](examples/rag/mongo_reduced_embeddings.ipynb). There are a lot of great frameworks that offer batching out of the box, and please check out our guides here for [LlamaIndex](https://github.com/run-llama/llama_index/blob/cf0da01e0cc756383e07eb499cb9825cfa17984d/docs/examples/vector_stores/MongoDBAtlasVectorSearchRAGFireworks.ipynb)"
    ]
   }
  ],