From 005b51a432224bca1a435bcac1faf26232c0b9f4 Mon Sep 17 00:00:00 2001 From: "Yufei (Benny) Chen" <1585539+benjibc@users.noreply.github.com> Date: Thu, 29 Feb 2024 17:48:08 -0800 Subject: [PATCH] Update MongoDB tutorials round 2 (#76) --- examples/rag/mongo_basic.ipynb | 145 ++++--- examples/rag/mongo_resize_embeddings.ipynb | 426 +++++++++++++++++++++ 2 files changed, 491 insertions(+), 80 deletions(-) create mode 100644 examples/rag/mongo_resize_embeddings.ipynb diff --git a/examples/rag/mongo_basic.ipynb b/examples/rag/mongo_basic.ipynb index 0f49d0a..65e0526 100644 --- a/examples/rag/mongo_basic.ipynb +++ b/examples/rag/mongo_basic.ipynb @@ -1,5 +1,14 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + " \"Open\n", + "" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -7,7 +16,7 @@ "# Movie recommender example with Fireworks + MongoDB + Nomic embedding model\n", "\n", "## Introduction\n", - "In this tutorial, we'll explore how to create an advanced movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data." + "In this tutorial, we'll explore how to create a basic movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data." ] }, { @@ -20,30 +29,13 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Requirement already satisfied: pymongo in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (4.6.1)\n", - "Requirement already satisfied: openai in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (1.9.0)\n", - "Requirement already satisfied: tqdm in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (4.66.1)\n", - "Requirement already satisfied: dnspython<3.0.0,>=1.16.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from pymongo) (2.5.0)\n", - "Requirement already satisfied: anyio<5,>=3.5.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (4.2.0)\n", - "Requirement already satisfied: distro<2,>=1.7.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (1.9.0)\n", - "Requirement already satisfied: httpx<1,>=0.23.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (0.25.2)\n", - "Requirement already satisfied: pydantic<3,>=1.9.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (2.5.3)\n", - "Requirement already satisfied: sniffio in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (1.3.0)\n", - "Requirement already satisfied: typing-extensions<5,>=4.7 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from openai) (4.9.0)\n", - "Requirement already satisfied: idna>=2.8 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (3.6)\n", - "Requirement already satisfied: exceptiongroup>=1.0.2 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (1.2.0)\n", - "Requirement already satisfied: certifi in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (2023.11.17)\n", - "Requirement already satisfied: httpcore==1.* in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (1.0.2)\n", - "Requirement already satisfied: h11<0.15,>=0.13 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n", - "Requirement already satisfied: annotated-types>=0.4.0 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.6.0)\n", - "Requirement already satisfied: pydantic-core==2.14.6 in /home/bchen/cookbook/.venv/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (2.14.6)\n", "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" @@ -51,7 +43,7 @@ } ], "source": [ - "!pip install pymongo openai tqdm" + "!pip install -q pymongo fireworks-ai tqdm openai" ] }, { @@ -64,11 +56,10 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ - "import openai\n", "import pymongo\n", "\n", "mongo_url = input()\n", @@ -77,10 +68,11 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ + "import openai\n", "fw_client = openai.OpenAI(\n", " api_key=input(),\n", " base_url=\"https://api.fireworks.ai/inference/v1\"\n", @@ -91,35 +83,38 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Understanding the Nomic-ai 1.5 Model\n", + "## Indexing and retrieval for movies.\n", + "We are going to build a model to index and retrieve movie recommendations. We will setup the most basic RAG example on top of MongoDB which involves\n", + "- MongoDB Atlas database that indexes movies based on embeddings\n", + "- a system for document embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns embeddings.\n", + "- a basic search engine that responds to user query by embedding the user query, fetching the corresponding movies, and then use an LLM to generate the recommendations.\n", "\n", - "The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model that has support for Matryoshka Representation Learning, which means you can change your embedding dimensions from 768 all the way down to 64, and to the quality/data trade-off you need.\n", + "## Understanding the Nomic-ai 1.5 Model\n", "\n", - "## Embedding Generation Function\n", - "The core of our recommender system is embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns dimensionality-reduced embeddings." + "The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model embedding model. It has other features such as dimensionality reduction, but needs some special prefixes to be used properly, which we can get into in the next section" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from typing import List\n", "\n", "def generate_embeddings(input_texts: List[str], model_api_string: str, prefix=\"\") -> List[List[float]]:\n", - " \"\"\"Generate embeddings from Fireworks python library and reduce their size by averaging adjacent elements.\n", + " \"\"\"Generate embeddings from Fireworks python library\n", "\n", " Args:\n", " input_texts: a list of string input texts.\n", " model_api_string: str. An API string for a specific embedding model of your choice.\n", + " prefix: what prefix to attach to the generate the embeddings, which is required for nomic 1.5. Please check out https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage for more information\n", "\n", " Returns:\n", " reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.\n", " \"\"\"\n", " if prefix:\n", " input_texts = [prefix + text for text in input_texts] \n", - " print(\"show updated input texts\", input_texts)\n", " return [x.embedding for x in \n", " fw_client.embeddings.create(\n", " input=input_texts,\n", @@ -137,7 +132,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -149,9 +144,9 @@ } ], "source": [ - "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5' # model API string from Together.\n", - "vector_database_field_name = 'embedding_2k_movies_fw_nomic_1_5' # define your embedding field name.\n", - "NUM_DOC_LIMIT = 400 # the number of documents you will process and generate embeddings.\n", + "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5'\n", + "vector_database_field_name = 'embeddings' # define your embedding field name.\n", + "NUM_DOC_LIMIT = 2000 # the number of documents you will process and generate embeddings.\n", "\n", "sample_output = generate_embeddings([\"This is a test.\"], embedding_model_string)\n", "print(f\"Embedding size is: {str(len(sample_output[0]))}\")\n" @@ -159,24 +154,9 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Document Processing : 0it [00:00, ?it/s]" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Document Processing : 400it [00:35, 11.24it/s]\n" - ] - } - ], + "outputs": [], "source": [ "from tqdm import tqdm\n", "from datetime import datetime\n", @@ -193,7 +173,7 @@ ").limit(NUM_DOC_LIMIT), desc=\"Document Processing \"):\n", " extracted_str = \"\\n\".join([k + \": \" + str(doc[k]) for k in keys_to_extract if k in doc])\n", " if vector_database_field_name not in doc:\n", - " doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string)[0]\n", + " doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string, \"search_document: \")[0]\n", " collection.replace_one({'_id': doc['_id']}, doc)\n" ] }, @@ -207,16 +187,16 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "'\\n{\\n \"fields\": [\\n {\\n \"type\": \"vector\",\\n \"path\": \"embedding_2k_movies_fw_e5_mistral\",\\n \"numDimensions\": 2048,\\n \"similarity\": \"dotProduct\"\\n }\\n ]\\n}\\n\\n'" + "'\\n{\\n \"fields\": [\\n {\\n \"type\": \"vector\",\\n \"path\": \"embeddings\",\\n \"numDimensions\": 768,\\n \"similarity\": \"dotProduct\"\\n }\\n ]\\n}\\n\\n'" ] }, - "execution_count": 6, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } @@ -227,7 +207,7 @@ " \"fields\": [\n", " {\n", " \"type\": \"vector\",\n", - " \"path\": \"embedding_2k_movies_fw_nomic_1_5\",\n", + " \"path\": \"embeddings\",\n", " \"numDimensions\": 768,\n", " \"similarity\": \"dotProduct\"\n", " }\n", @@ -247,33 +227,33 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "show updated input texts ['Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\\nQuery: I love superhero movies, any recommendations?']\n", - "From your query \"I love superhero movies, any recommendations?\", the following movie listings were found:\n", + "show updated input texts ['search_query: I like Christmas movies, any recommendations?']\n", + "From your query \"I like Christmas movies, any recommendations?\", the following movie listings were found:\n", "\n", - "1. Spider-Man\n", - "2. Lara Croft: Tomb Raider\n", - "3. Monkeybone\n", - "4. Fantastic Four\n", - "5. Titan A.E.\n", - "6. Sinbad: Legend of the Seven Seas\n", - "7. Charlie's Angels\n", - "8. X-Men\n", - "9. Alaska: Spirit of the Wild\n", - "10. Planet of the Apes\n" + "1. Surviving Christmas\n", + "2. Christmas Carol: The Movie\n", + "3. How the Grinch Stole Christmas\n", + "4. 'Twas the Night\n", + "5. Love Actually\n", + "6. Dead End\n", + "7. Bad Santa\n", + "8. 'R Xmas\n", + "9. Casper's Haunted Christmas\n", + "10. The Ultimate Christmas Present\n" ] } ], "source": [ "# Example query.\n", - "query = \"I love superhero movies, any recommendations?\"\n", - "prefix=\"Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\\nQuery: \"\n", + "query = \"I like Christmas movies, any recommendations?\"\n", + "prefix=\"search_query: \"\n", "query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)[0]\n", "\n", "results = collection.aggregate([\n", @@ -283,7 +263,7 @@ " \"path\": vector_database_field_name,\n", " \"numCandidates\": 100, # this should be 10-20x the limit\n", " \"limit\": 10, # the number of documents to return in the results\n", - " \"index\": 'movie_index', # the index name you used in Step 4, here we default to basics\n", + " \"index\": 'movie', # the index name you used in the earlier step\n", " }\n", " }\n", "])\n", @@ -304,14 +284,13 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "your_task_prompt = (\n", - " \"From the given movie listing data, choose a great movie recommendation for superhero movies. \"\n", - " \"I don't like spider man though. \"\n", - " \"Tell me the name of the movie and why it works for me.\"\n", + " \"From the given movie listing data, choose a few great movie recommendation given the user query. \"\n", + " f\"User query: {query}\"\n", ")\n", "\n", "listing_data = \"\"\n", @@ -332,14 +311,19 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Based on your preference to exclude Spider-Man movies, I would recommend \"X-Men.\" This movie is a great choice for superhero fans as it features a team of mutants with unique abilities who fight to protect humanity from a dangerous terrorist organization. The film features impressive special effects, engaging action sequences, and well-developed characters, making it an exciting and entertaining viewing experience. Additionally, the themes of acceptance and prejudice add depth to the story, making it a great pick for those who enjoy thought-provoking superhero movies.\n" + "Based on the user's query, I would recommend the following Christmas movies from the provided data:\n", + "\n", + "1. \"Love Actually\" - A romantic comedy that takes place in the five weeks preceding Christmas, following the lives of eight couples in dealing with their love lives in various interrelated tales all set in London, England.\n", + "2. \"How the Grinch Stole Christmas\" - A live-action adaptation of Dr. Seuss's classic holiday tale about a green, revenge-seeking Grinch who decides to ruin Christmas for the cheery residents of Whoville.\n", + "3. \"Surviving Christmas\" - A comedy about a wealthy Chicago advertisement executive who, after being left by his girlfriend right before Christmas, hires a family to spend the holiday with him in his childhood home.\n", + "4. \"Christmas Carol: The Movie\" - An animated retelling of Charles Dickens' classic story, where Ebenezer Scrooge learns\n" ] } ], @@ -357,8 +341,9 @@ "metadata": {}, "source": [ "## Conclusion\n", - "And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the Mistral E5 embedding model. This system can be further customized and scaled to suit various needs.\n", - "\n" + "And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the nomic-ai embedding model. This system can be further customized and scaled to suit various needs. There are still a few things that is missing in our guides\n", + "- we used the default 768 embedding dimension in the example. There are cases where the cost for storing the embedding is high, and you might want to reduce that, and we will walk you through another example with MongoDB + leveraging Matryoshka embedding to reduce embedding size in [this guide](examples/rag/mongo_reduced_embeddings.ipynb)\n", + "- we are only documenting 400 movies in this example, which is not a lot. This is because we wanted to keep this tutorial simple and not batching the embedding lookups, and just have a for loop that goes through all the documents and embed them manually. This method does not scale. First, we will cover basic batching in the [following guide](examples/rag/mongo_reduced_embeddings.ipynb). There are a lot of great frameworks that offer batching out of the box, and please check out our guides here for [LlamaIndex](https://github.com/run-llama/llama_index/blob/cf0da01e0cc756383e07eb499cb9825cfa17984d/docs/examples/vector_stores/MongoDBAtlasVectorSearchRAGFireworks.ipynb)" ] } ], diff --git a/examples/rag/mongo_resize_embeddings.ipynb b/examples/rag/mongo_resize_embeddings.ipynb new file mode 100644 index 0000000..2cef1a6 --- /dev/null +++ b/examples/rag/mongo_resize_embeddings.ipynb @@ -0,0 +1,426 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + " \"Open\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reduced embedding dimension example with Fireworks + MongoDB + Nomic\n", + "\n", + "## Introduction\n", + "Hopefully you have went through the [previous cookbook](examples/rag/mongo_basic.ipynb) to go through the basics. In this tutorial, we'll explore how to create an basic movie recommendation system with variable cost for storage quality trade-off. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up Your Environment\n", + "Before we dive into the code, make sure to set up your environment. This involves installing necessary packages like pymongo and openai. Run the following command in your notebook to install these packages:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "!pip install -q pymongo fireworks-ai tqdm openai" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initializing Fireworks and MongoDB Clients\n", + "To interact with Fireworks and MongoDB, we need to initialize their respective clients. Replace \"YOUR FIREWORKS API KEY\" and \"YOUR MONGO URL\" with your actual credentials." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import pymongo\n", + "\n", + "mongo_url = input()\n", + "client = pymongo.MongoClient(mongo_url)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "fw_client = openai.OpenAI(\n", + " api_key=input(),\n", + " base_url=\"https://api.fireworks.ai/inference/v1\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Indexing and retrieval for movies.\n", + "We are going to build a model to index and retrieve movie recommendations. We will setup the most basic RAG example on top of MongoDB which involves\n", + "- MongoDB Atlas database that indexes movies based on embeddings\n", + "- a system for document embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns dimensionality-reduced embeddings.\n", + " - The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model embedding model. You can ask it to not only produce embeddings with size 786, but also embeddings with smaller dimensions all the way down to 64. In this example, we can try to use dimension 128 and see if we can get the example up and running without any quality impact.\n", + "- a basic search engine that responds to user query by embedding the user query, fetching the corresponding movies, and then use an LLM to generate the recommendations.\n", + "\n", + "We will update our generate_embeddings example slightly to reflect how we are going to query with variable embedding table dimensions" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "\n", + "def generate_embeddings(\n", + " input_texts: List[str],\n", + " model_api_string: str,\n", + " embedding_dimensions: int = 768,\n", + " prefix=\"\"\n", + ") -> List[List[float]]:\n", + " \"\"\"Generate embeddings from Fireworks python library\n", + "\n", + " Args:\n", + " input_texts: a list of string input texts.\n", + " model_api_string: str. An API string for a specific embedding model of your choice.\n", + " embedding_dimensions: int = 768,\n", + " prefix: what prefix to attach to the generate the embeddings, which is required for nomic 1.5. Please check out https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage for more information\n", + "\n", + " Returns:\n", + " reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.\n", + " \"\"\"\n", + " if prefix:\n", + " input_texts = [prefix + text for text in input_texts] \n", + " return [x.embedding for x in \n", + " fw_client.embeddings.create(\n", + " input=input_texts,\n", + " model=model_api_string,\n", + " dimensions=embedding_dimensions,\n", + " ).data]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Processing\n", + "Now, let's process our movie data. We'll extract key information from our MongoDB collection and generate embeddings for each movie. Ensure NUM_DOC_LIMIT is set to limit the number of documents processed." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Embedding size is: 128\n" + ] + } + ], + "source": [ + "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5'\n", + "vector_database_field_name = 'embeddings_128' # define your embedding field name.\n", + "NUM_DOC_LIMIT = 2000 # the number of documents you will process and generate embeddings.\n", + "\n", + "sample_output = generate_embeddings([\"This is a test.\"], embedding_model_string, embedding_dimensions=128)\n", + "print(f\"Embedding size is: {str(len(sample_output[0]))}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Batching\n", + "we will also walk through how to do basic batching. When you are querying Fireworks API, you can add more than one documents per call, and the embedding results will be returned in the same order. we will batch the 2000 examples into units of 200." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Document Processing : 2000it [00:02, 837.45it/s] \n", + "generate and insert embeddings: 10it [02:54, 17.48s/it]\n" + ] + } + ], + "source": [ + "from tqdm import tqdm\n", + "from datetime import datetime\n", + "\n", + "db = client.sample_mflix\n", + "collection = db.movies\n", + "\n", + "keys_to_extract = [\"plot\", \"genre\", \"cast\", \"title\", \"fullplot\", \"countries\", \"directors\"]\n", + "\n", + "extracted_str_list = []\n", + "for doc in tqdm(collection.find(\n", + " {\n", + " \"fullplot\":{\"$exists\": True},\n", + " \"released\": { \"$gt\": datetime(2000, 1, 1, 0, 0, 0)},\n", + " }\n", + ").limit(NUM_DOC_LIMIT), desc=\"Document Processing \"):\n", + " extracted_str = \"\\n\".join([k + \": \" + str(doc[k]) for k in keys_to_extract if k in doc])\n", + " extracted_str_list.append((doc['_id'], extracted_str))\n", + "\n", + "# Chunk extracted_str_list into batches of 512\n", + "str_batches = zip(*(iter(extracted_str_list),) * 200)\n", + "\n", + "# Iterate over each batch\n", + "for batch in tqdm(str_batches, desc=\"generate and insert embeddings\"):\n", + " # Generate embeddings for the current batch\n", + " embeddings = generate_embeddings(\n", + " [t[1] for t in batch], # Extract the extracted strings from the tuples\n", + " embedding_model_string,\n", + " prefix=\"search_document: \",\n", + " embedding_dimensions=128,\n", + " )\n", + "\n", + " # Update documents with the generated embeddings\n", + " for i, embedding in enumerate(embeddings):\n", + " doc = collection.find_one({'_id': batch[i][0]})\n", + " doc[vector_database_field_name] = embedding\n", + " collection.replace_one({'_id': batch[i][0]}, doc)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setting Up the Search Index\n", + "For our system to efficiently search through movie embeddings, we need to set up a search index in MongoDB. Define the index structure as shown:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\n{\\n \"fields\": [\\n {\\n \"type\": \"vector\",\\n \"path\": \"embeddings\",\\n \"numDimensions\": 768,\\n \"similarity\": \"dotProduct\"\\n }\\n ]\\n}\\n\\n'" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\"\"\n", + "{\n", + " \"fields\": [\n", + " {\n", + " \"type\": \"vector\",\n", + " \"path\": \"embeddings\",\n", + " \"numDimensions\": 768,\n", + " \"similarity\": \"dotProduct\"\n", + " },\n", + " {\n", + " \"type\": \"vector\",\n", + " \"path\": \"embeddings_128\",\n", + " \"numDimensions\": 128,\n", + " \"similarity\": \"dotProduct\"\n", + " }\n", + " ]\n", + "}\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Querying the Recommender System\n", + "Let's test our recommender system. We create a query for superhero movies and exclude Spider-Man movies, as per user preference." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "From your query \"I like Christmas movies, any recommendations?\", the following movie listings were found:\n", + "\n", + "1. Christmas Carol: The Movie\n", + "2. Love Actually\n", + "3. Surviving Christmas\n", + "4. Almost Famous\n", + "5. Dead End\n", + "6. Up, Up, and Away!\n", + "7. Do Fish Do It?\n", + "8. Let It Snow\n", + "9. The Little Polar Bear\n", + "10. One Point O\n" + ] + } + ], + "source": [ + "# Example query.\n", + "query = \"I like Christmas movies, any recommendations?\"\n", + "prefix=\"search_query: \"\n", + "query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix, embedding_dimensions=128)[0]\n", + "\n", + "results = collection.aggregate([\n", + " {\n", + " \"$vectorSearch\": {\n", + " \"queryVector\": query_emb,\n", + " \"path\": vector_database_field_name,\n", + " \"numCandidates\": 100, # this should be 10-20x the limit\n", + " \"limit\": 10, # the number of documents to return in the results\n", + " \"index\": 'movie', # the index name you used in the earlier step\n", + " }\n", + " }\n", + "])\n", + "results_as_dict = {doc['title']: doc for doc in results}\n", + "\n", + "print(f\"From your query \\\"{query}\\\", the following movie listings were found:\\n\")\n", + "print(\"\\n\".join([str(i+1) + \". \" + name for (i, name) in enumerate(results_as_dict.keys())]))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the results are very similar results with just 128 dimensions. So if you feel that 128 dimensions are good enough for your use case, you can reduce the dimensions and save some database cost." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generating Recommendations\n", + "Finally, we use Fireworks' chat API to generate a personalized movie recommendation based on the user's query and preferences.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "your_task_prompt = (\n", + " \"From the given movie listing data, choose a few great movie recommendations. \"\n", + " f\"User query: {query}\"\n", + ")\n", + "\n", + "listing_data = \"\"\n", + "for doc in results_as_dict.values():\n", + " listing_data += f\"Movie title: {doc['title']}\\n\"\n", + " for (k, v) in doc.items():\n", + " if not(k in keys_to_extract) or (\"embedding\" in k): continue\n", + " if k == \"name\": continue\n", + " listing_data += k + \": \" + str(v) + \"\\n\"\n", + " listing_data += \"\\n\"\n", + "\n", + "augmented_prompt = (\n", + " \"movie listing data:\\n\"\n", + " f\"{listing_data}\\n\\n\"\n", + " f\"{your_task_prompt}\"\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Based on the user's preference for Christmas movies, here are a few great recommendations from the given movie listing data:\n", + "\n", + "1. Christmas Carol: The Movie - A beautiful animated movie adaptation of Charles Dickens' classic Christmas tale, featuring an all-star cast including Simon Callow, Kate Winslet, and Nicolas Cage.\n", + "2. Love Actually - A heartwarming ensemble romantic comedy set during the Christmas season in London, starring Bill Nighy, Colin Firth, Hugh Grant, and Liam Neeson, among many others.\n", + "3. Surviving Christmas - A funny and touching holiday movie about a rich and lonely man (Ben Affleck) who hires a family to spend Christmas with him, only to find that their presence helps him rediscover the true meaning of the season.\n", + "\n", + "Hope these recommendations fit your taste and bring you some holiday cheer!\n" + ] + } + ], + "source": [ + "response = fw_client.chat.completions.create(\n", + " messages=[{\"role\": \"user\", \"content\": augmented_prompt}],\n", + " model=\"accounts/fireworks/models/mixtral-8x7b-instruct\",\n", + ")\n", + "\n", + "print(response.choices[0].message.content)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "You've successfully updated a movie recommendation with batching and variable embeddings. Now if are interested in pushing further to integrate MongoDB + Fireworks into your systems, you can check out our\n", + "- [LangChain integration, with function calling](https://github.com/fw-ai/cookbook/blob/main/examples/rag/mongodb_agent.ipynb)\n", + "- [LlamaIndex](https://github.com/run-llama/llama_index/blob/cf0da01e0cc756383e07eb499cb9825cfa17984d/docs/examples/vector_stores/MongoDBAtlasVectorSearchRAGFireworks.ipynb)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}