From 0efa45c9998c88f8b54cdead18e2392e7b2e65d9 Mon Sep 17 00:00:00 2001 From: Aravind Putrevu Date: Tue, 9 Apr 2024 01:39:44 +0530 Subject: [PATCH] Updated basic mongo notebook with some API changes (#81) * Updated basic mongo notebook with some API changes * Final update to the basic notebook --- examples/rag/mongo_basic.ipynb | 840 ++++++++++++++++++--------------- 1 file changed, 466 insertions(+), 374 deletions(-) diff --git a/examples/rag/mongo_basic.ipynb b/examples/rag/mongo_basic.ipynb index fe75a86..c26ecae 100644 --- a/examples/rag/mongo_basic.ipynb +++ b/examples/rag/mongo_basic.ipynb @@ -1,384 +1,476 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - " \"Open\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Movie recommender example with Fireworks + MongoDB + Nomic embedding model\n", - "\n", - "## Introduction\n", - "In this tutorial, we'll explore how to create a basic movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up Your Environment\n", - "Before we dive into the code, make sure to set up your environment. This involves installing necessary packages like pymongo and openai. Run the following command in your notebook to install these packages:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ + "cells": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" - ] - } - ], - "source": [ - "!pip install -q pymongo fireworks-ai tqdm openai" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initializing Fireworks and MongoDB Clients\n", - "To interact with Fireworks and MongoDB, we need to initialize their respective clients. Replace \"YOUR FIREWORKS API KEY\" and \"YOUR MONGO URL\" with your actual credentials." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import pymongo\n", - "\n", - "mongo_url = input()\n", - "client = pymongo.MongoClient(mongo_url)" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import openai\n", - "fw_client = openai.OpenAI(\n", - " api_key=input(),\n", - " base_url=\"https://api.fireworks.ai/inference/v1\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Indexing and retrieval for movies.\n", - "We are going to build a model to index and retrieve movie recommendations. We will setup the most basic RAG example on top of MongoDB which involves\n", - "- MongoDB Atlas database that indexes movies based on embeddings\n", - "- a system for document embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns embeddings.\n", - "- a basic search engine that responds to user query by embedding the user query, fetching the corresponding movies, and then use an LLM to generate the recommendations.\n", - "\n", - "## Understanding the Nomic-ai 1.5 Model\n", - "\n", - "The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model embedding model. It has other features such as dimensionality reduction, but needs some special prefixes to be used properly, which we can get into in the next section" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import List\n", - "\n", - "def generate_embeddings(input_texts: str, model_api_string: str, prefix=\"\") -> List[float]:\n", - " \"\"\"Generate embeddings from Fireworks python library\n", - "\n", - " Args:\n", - " input_texts: a list of string input texts.\n", - " model_api_string: str. An API string for a specific embedding model of your choice.\n", - " prefix: what prefix to attach to the generate the embeddings, which is required for nomic 1.5. Please check out https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage for more information\n", - "\n", - " Returns:\n", - " reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.\n", - " \"\"\"\n", - " if prefix:\n", - " input_texts = [prefix + text for text in input_texts] \n", - " return fw_client.embeddings.create(\n", - " input=input_texts,\n", - " model=model_api_string,\n", - " ).data[0].embedding" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the function above, we did not implement batching and always return the embedding at position zero. For how to do batching, we will cover it in the next tutorial." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data Processing\n", - "Now, let's process our movie data. We'll extract key information from our MongoDB collection and generate embeddings for each movie. Ensure NUM_DOC_LIMIT is set to limit the number of documents processed." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "metadata": { + "id": "ZFJLN_BhQUo5" + }, + "source": [ + "\n", + " \"Open\n", + "" + ] + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "Embedding size is: 768\n" - ] - } - ], - "source": [ - "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5'\n", - "vector_database_field_name = 'embed' # define your embedding field name.\n", - "NUM_DOC_LIMIT = 2000 # the number of documents you will process and generate embeddings.\n", - "\n", - "sample_output = generate_embeddings([\"This is a test.\"], embedding_model_string)\n", - "print(f\"Embedding size is: {str(len(sample_output))}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "metadata": { + "id": "6sxuDF2UQUo6" + }, + "source": [ + "# Movie recommender example with Fireworks + MongoDB + Nomic embedding model\n", + "\n", + "## Introduction\n", + "In this tutorial, we'll explore how to create a basic movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data." + ] + }, { - "name": "stderr", - "output_type": "stream", - "text": [ - "Document Processing : 2000it [01:56, 17.22it/s]\n" - ] - } - ], - "source": [ - "from tqdm import tqdm\n", - "from datetime import datetime\n", - "\n", - "db = client.sample_mflix\n", - "collection = db.movies\n", - "\n", - "keys_to_extract = [\"plot\", \"genre\", \"cast\", \"title\", \"fullplot\", \"countries\", \"directors\"]\n", - "for doc in tqdm(collection.find(\n", - " {\n", - " \"fullplot\":{\"$exists\": True},\n", - " \"released\": { \"$gt\": datetime(2000, 1, 1, 0, 0, 0)},\n", - " }\n", - ").limit(NUM_DOC_LIMIT), desc=\"Document Processing \"):\n", - " extracted_str = \"\\n\".join([k + \": \" + str(doc[k]) for k in keys_to_extract if k in doc])\n", - " if vector_database_field_name not in doc:\n", - " doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string, \"search_document: \")\n", - " collection.replace_one({'_id': doc['_id']}, doc)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting Up the Search Index\n", - "For our system to efficiently search through movie embeddings, we need to set up a search index in MongoDB. Define the index structure as shown:" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "metadata": { + "id": "XzyUpYYyQUo7" + }, + "source": [ + "## Setting Up Your Environment\n", + "Before we dive into the code, make sure to set up your environment. This involves installing necessary packages like pymongo and openai. Run the following command in your notebook to install these packages:" + ] + }, { - "data": { - "text/plain": [ - "'\\n{\\n \"fields\": [\\n {\\n \"type\": \"vector\",\\n \"path\": \"embeddings\",\\n \"numDimensions\": 768,\\n \"similarity\": \"dotProduct\"\\n }\\n ]\\n}\\n\\n'" + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0z1XI0QJQUo7", + "outputId": "3e3ee40e-5351-4d8c-c1f3-e5b5d6b8fd4c" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/676.9 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m112.6/676.9 kB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━\u001b[0m \u001b[32m450.6/676.9 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m675.8/676.9 kB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m676.9/676.9 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m80.5/80.5 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m267.1/267.1 kB\u001b[0m \u001b[31m9.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m307.7/307.7 kB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m3.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h" + ] + } + ], + "source": [ + "!pip install -q pymongo fireworks-ai tqdm openai" ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"\"\"\n", - "{\n", - " \"fields\": [\n", - " {\n", - " \"type\": \"vector\",\n", - " \"path\": \"embed\",\n", - " \"numDimensions\": 768,\n", - " \"similarity\": \"dotProduct\"\n", - " }\n", - " ]\n", - "}\n", - "\n", - "\"\"\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Querying the Recommender System\n", - "Let's test our recommender system. We create a query for Christmas movies, as per user preference." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "From your query \"I like Christmas movies, any recommendations?\", the following movie listings were found:\n", - "\n", - "1. Surviving Christmas\n", - "2. Christmas Carol: The Movie\n", - "3. How the Grinch Stole Christmas\n", - "4. 'Twas the Night\n", - "5. Love Actually\n", - "6. Dead End\n", - "7. Bad Santa\n", - "8. 'R Xmas\n", - "9. Casper's Haunted Christmas\n", - "10. The Ultimate Christmas Present\n" - ] - } - ], - "source": [ - "# Example query.\n", - "query = \"I like Christmas movies, any recommendations?\"\n", - "prefix=\"search_query: \"\n", - "query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)\n", - "\n", - "results = collection.aggregate([\n", - " {\n", - " \"$vectorSearch\": {\n", - " \"queryVector\": query_emb,\n", - " \"path\": vector_database_field_name,\n", - " \"numCandidates\": 100, # this should be 10-20x the limit\n", - " \"limit\": 10, # the number of documents to return in the results\n", - " \"index\": 'movie', # the index name you used in the earlier step\n", - " }\n", - " }\n", - "])\n", - "results_as_dict = {doc['title']: doc for doc in results}\n", - "\n", - "print(f\"From your query \\\"{query}\\\", the following movie listings were found:\\n\")\n", - "print(\"\\n\".join([str(i+1) + \". \" + name for (i, name) in enumerate(results_as_dict.keys())]))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Generating Recommendations\n", - "Finally, we use Fireworks' chat API to generate a personalized movie recommendation based on the user's query and preferences.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [], - "source": [ - "your_task_prompt = (\n", - " \"From the given movie listing data, choose a few great movie recommendation given the user query. \"\n", - " f\"User query: {query}\"\n", - ")\n", - "\n", - "listing_data = \"\"\n", - "for doc in results_as_dict.values():\n", - " listing_data += f\"Movie title: {doc['title']}\\n\"\n", - " for (k, v) in doc.items():\n", - " if not(k in keys_to_extract) or (\"embedding\" in k): continue\n", - " if k == \"name\": continue\n", - " listing_data += k + \": \" + str(v) + \"\\n\"\n", - " listing_data += \"\\n\"\n", - "\n", - "augmented_prompt = (\n", - " \"movie listing data:\\n\"\n", - " f\"{listing_data}\\n\\n\"\n", - " f\"{your_task_prompt}\"\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "metadata": { + "id": "lrwAcZFsQUo7" + }, + "source": [ + "## Initializing Fireworks and MongoDB Clients\n", + "To interact with Fireworks and MongoDB, we need to initialize their respective clients. Replace \"YOUR_FIREWORKS_API_KEY\" and \"YOUR_MONGO_URL\" with your actual credentials.\n", + "\n", + "\n", + "Please create a Mongodb Atlas cluster using the link [here](https://www.mongodb.com/atlas)." + ] + }, + { + "cell_type": "markdown", + "source": [ + "```\n", + "Note:\n", + "\n", + "1. You should create a create a user name, password pair and fill those details in the URI below. MongoDB URI would like `mongodb+srv://:@`. \n", + "2. In MongoDB Atlas, you can only connect to a cluster from a trusted IP address. You must add your IP address to the IP access list before you can connect to your cluster. OR open access to public internet by configuring `0.0.0.0/0`.\n", + "\n", + "```" + ], + "metadata": { + "id": "jZYR4w7UAF-a" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7_Vq6YC5QUo8" + }, + "outputs": [], + "source": [ + "from pymongo.mongo_client import MongoClient\n", + "from pymongo.server_api import ServerApi\n", + "\n", + "\n", + "uri = \"YOUR_MONGO_URL\" # you can copy uri from MongoDB Atlas Cloud Console https://cloud.mongodb.com\n", + "\n", + "# Create a new client and connect to the server\n", + "client = MongoClient(uri, server_api=ServerApi('1'))" + ] + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "Based on the user's query, I would recommend the following Christmas movies from the given data:\n", - "\n", - "1. \"Love Actually\" (2003) - A romantic comedy that follows the lives of eight couples in London during the Christmas season, dealing with various aspects of love and relationships.\n", - "2. \"The Grinch\" (2000) - A family-friendly animated film about the Grinch, a creature who despises Christmas and sets out to steal it from the residents of Whoville, but is eventually won over by the spirit of the holiday.\n", - "3. \"Surviving Christmas\" (2004) - A comedy about a wealthy man who hires a family to spend Christmas with him in his childhood home, leading to unexpected consequences and a journey of self-discovery.\n", - "4. \"Christmas Carol: The Movie\" (2001) - An animated retelling of Charles Dickens' classic story of E\n" - ] + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F8XMc9oYQUo8" + }, + "outputs": [], + "source": [ + "import openai\n", + "fw_client = openai.OpenAI(\n", + " api_key=\"YOUR_FIREWORKS_API_KEY\", # you can find Fireworks API key under accounts -> API keys\n", + " base_url=\"https://api.fireworks.ai/inference/v1\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_SXYrvWxQUo8" + }, + "source": [ + "## Indexing and retrieval for movies.\n", + "We are going to build a model to index and retrieve movie recommendations. We will setup the most basic RAG example on top of MongoDB which involves\n", + "- MongoDB Atlas database that indexes movies based on embeddings\n", + "- a system for document embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns embeddings.\n", + "- a basic search engine that responds to user query by embedding the user query, fetching the corresponding movies, and then use an LLM to generate the recommendations.\n", + "\n", + "## Understanding the Nomic-ai 1.5 Model\n", + "\n", + "The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model embedding model. It has other features such as dimensionality reduction, but needs some special prefixes to be used properly, which we can get into in the next section" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NLxxWRIYQUo8" + }, + "outputs": [], + "source": [ + "from typing import List\n", + "\n", + "def generate_embeddings(input_texts: str, model_api_string: str, prefix=\"\") -> List[float]:\n", + " \"\"\"Generate embeddings from Fireworks python library\n", + "\n", + " Args:\n", + " input_texts: a list of string input texts.\n", + " model_api_string: str. An API string for a specific embedding model of your choice.\n", + " prefix: what prefix to attach to the generate the embeddings, which is required for nomic 1.5. Please check out https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage for more information\n", + "\n", + " Returns:\n", + " reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.\n", + " \"\"\"\n", + " if prefix:\n", + " input_texts = [prefix + text for text in input_texts]\n", + " return fw_client.embeddings.create(\n", + " input=input_texts,\n", + " model=model_api_string,\n", + " ).data[0].embedding" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rr--Yl_DQUo8" + }, + "source": [ + "In the function above, we did not implement batching and always return the embedding at position zero. For how to do batching, we will cover it in the next tutorial." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-zCf7mabQUo8" + }, + "source": [ + "## Data Processing\n", + "Now, let's process our movie data. We'll extract key information from our MongoDB collection and generate embeddings for each movie. Ensure NUM_DOC_LIMIT is set to limit the number of documents processed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "hFGimiaPQUo9", + "outputId": "f5616422-1510-4071-e024-aabf178d859f" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Embedding size is: 768\n" + ] + } + ], + "source": [ + "embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5'\n", + "vector_database_field_name = 'embed' # define your embedding field name.\n", + "NUM_DOC_LIMIT = 2000 # the number of documents you will process and generate embeddings.\n", + "\n", + "sample_output = generate_embeddings([\"This is a test.\"], embedding_model_string)\n", + "print(f\"Embedding size is: {str(len(sample_output))}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pg3tBBqaQUo9", + "outputId": "7f0db198-4d55-443f-8733-306abda9ec54" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "Document Processing : 0it [00:01, ?it/s]\n" + ] + } + ], + "source": [ + "from tqdm import tqdm\n", + "from datetime import datetime\n", + "\n", + "db = client.sample_mflix # loading sample dataset from MongoDB Atlas\n", + "collection = db.movies\n", + "\n", + "keys_to_extract = [\"plot\", \"genre\", \"cast\", \"title\", \"fullplot\", \"countries\", \"directors\"]\n", + "for doc in tqdm(collection.find(\n", + " {\n", + " \"fullplot\":{\"$exists\": True},\n", + " \"released\": { \"$gt\": datetime(2000, 1, 1, 0, 0, 0)},\n", + " }\n", + ").limit(NUM_DOC_LIMIT), desc=\"Document Processing \"):\n", + " extracted_str = \"\\n\".join([k + \": \" + str(doc[k]) for k in keys_to_extract if k in doc])\n", + " if vector_database_field_name not in doc:\n", + " doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string, \"search_document: \")\n", + " collection.replace_one({'_id': doc['_id']}, doc)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UpECpjuLQUo9" + }, + "source": [ + "## Setting Up the Search Index\n", + "For our system to efficiently search through movie embeddings, we need to set up a search index in MongoDB. Please run the below cell to define the index structure as shown:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "sInWVcVHQUo9", + "outputId": "72c08f9c-ecbf-4931-d2f7-d85810b3d6f8" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'\\n{\\n \"fields\": [\\n {\\n \"type\": \"vector\",\\n \"path\": \"embed\",\\n \"numDimensions\": 768,\\n \"similarity\": \"dotProduct\"\\n }\\n ]\\n}\\n\\n'" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } + }, + "metadata": {}, + "execution_count": 10 + } + ], + "source": [ + "\"\"\"\n", + "{\n", + " \"fields\": [\n", + " {\n", + " \"type\": \"vector\",\n", + " \"path\": \"embed\",\n", + " \"numDimensions\": 768,\n", + " \"similarity\": \"dotProduct\"\n", + " }\n", + " ]\n", + "}\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KYruuAQiQUo9" + }, + "source": [ + "## Querying the Recommender System\n", + "Let's test our recommender system. We create a query for superhero movies and exclude Spider-Man movies, as per user preference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "irLBhxvpQUo9", + "outputId": "b84c6a69-cd8f-440e-a568-87cdca61ed99" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "From your query \"I like Christmas movies, any recommendations?\", the following movie listings were found:\n", + "\n", + "\n" + ] + } + ], + "source": [ + "# Example query.\n", + "query = \"I like Christmas movies, any recommendations?\"\n", + "prefix=\"search_query: \"\n", + "query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)\n", + "\n", + "results = collection.aggregate([\n", + " {\n", + " \"$vectorSearch\": {\n", + " \"queryVector\": query_emb,\n", + " \"path\": vector_database_field_name,\n", + " \"numCandidates\": 100, # this should be 10-20x the limit\n", + " \"limit\": 10, # the number of documents to return in the results\n", + " \"index\": 'movie', # the index name you used in the earlier step\n", + " }\n", + " }\n", + "])\n", + "results_as_dict = {doc['title']: doc for doc in results}\n", + "\n", + "print(f\"From your query \\\"{query}\\\", the following movie listings were found:\\n\")\n", + "print(\"\\n\".join([str(i+1) + \". \" + name for (i, name) in enumerate(results_as_dict.keys())]))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LX8T8ihzQUo9" + }, + "source": [ + "## Generating Recommendations\n", + "Finally, we use Fireworks' chat API to generate a personalized movie recommendation based on the user's query and preferences.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w_Y1OhbIQUo9" + }, + "outputs": [], + "source": [ + "your_task_prompt = (\n", + " \"From the given movie listing data, choose a few great movie recommendation given the user query. \"\n", + " f\"User query: {query}\"\n", + ")\n", + "\n", + "listing_data = \"\"\n", + "for doc in results_as_dict.values():\n", + " listing_data += f\"Movie title: {doc['title']}\\n\"\n", + " for (k, v) in doc.items():\n", + " if not(k in keys_to_extract) or (\"embedding\" in k): continue\n", + " if k == \"name\": continue\n", + " listing_data += k + \": \" + str(v) + \"\\n\"\n", + " listing_data += \"\\n\"\n", + "\n", + "augmented_prompt = (\n", + " \"movie listing data:\\n\"\n", + " f\"{listing_data}\\n\\n\"\n", + " f\"{your_task_prompt}\"\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PIfbE8J9QUo9", + "outputId": "1f72db83-5e75-4e07-df5b-27138cc530ca" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Based on the movie listing data you provided, here are some great Christmas movie recommendations for the user:\n", + "\n", + "1. \"Elf\" (2003) - A comedic and heartwarming story about a man raised as an elf at the North Pole who travels to New York to find his biological father.\n", + "2. \"National Lampoon's Christmas Vacation\" (1989) - A classic comedy about a family's chaotic and hilarious holiday season.\n", + "3. \"The Polar Express\" (2004) - An animated adventure that follows a young boy's journey to the North Pole on a magical train.\n", + "4. \"Home Alone\" (1990) - A classic comedy about a young boy who is accidentally left behind at home while his family goes on vacation for Christmas.\n", + "5. \"A Christmas Story\" (1983) - A classic comedy about a\n" + ] + } + ], + "source": [ + "response = fw_client.chat.completions.create(\n", + " messages=[{\"role\": \"user\", \"content\": augmented_prompt}],\n", + " model=\"accounts/fireworks/models/mixtral-8x7b-instruct\",\n", + ")\n", + "\n", + "print(response.choices[0].message.content)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lRMKeTrcQUo9" + }, + "source": [ + "## Conclusion\n", + "And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the nomic-ai embedding model. This system can be further customized and scaled to suit various needs. There are still a few things that is missing in our guides\n", + "- we used the default 768 embedding dimension in the example. There are cases where the cost for storing the embedding is high, and you might want to reduce that, and we will walk you through another example with MongoDB + leveraging Matryoshka embedding to reduce embedding size in [this guide](examples/rag/mongo_reduced_embeddings.ipynb)\n", + "- we are only documenting 400 movies in this example, which is not a lot. This is because we wanted to keep this tutorial simple and not batching the embedding lookups, and just have a for loop that goes through all the documents and embed them manually. This method does not scale. First, we will cover basic batching in the [following guide](examples/rag/mongo_reduced_embeddings.ipynb). There are a lot of great frameworks that offer batching out of the box, and please check out our guides here for [LlamaIndex](https://github.com/run-llama/llama_index/blob/cf0da01e0cc756383e07eb499cb9825cfa17984d/docs/examples/vector_stores/MongoDBAtlasVectorSearchRAGFireworks.ipynb)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "colab": { + "provenance": [] } - ], - "source": [ - "response = fw_client.chat.completions.create(\n", - " messages=[{\"role\": \"user\", \"content\": augmented_prompt}],\n", - " model=\"accounts/fireworks/models/mixtral-8x7b-instruct\",\n", - ")\n", - "\n", - "print(response.choices[0].message.content)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the nomic-ai embedding model. This system can be further customized and scaled to suit various needs. There are still a few things that is missing in our guides\n", - "- we used the default 768 embedding dimension in the example. There are cases where the cost for storing the embedding is high, and you might want to reduce that, and we will walk you through another example with MongoDB + leveraging Matryoshka embedding to reduce embedding size in [this guide](examples/rag/mongo_reduced_embeddings.ipynb)\n", - "- we are only documenting 400 movies in this example, which is not a lot. This is because we wanted to keep this tutorial simple and not batching the embedding lookups, and just have a for loop that goes through all the documents and embed them manually. This method does not scale. First, we will cover basic batching in the [following guide](examples/rag/mongo_reduced_embeddings.ipynb). There are a lot of great frameworks that offer batching out of the box, and please check out our guides here for [LlamaIndex](https://github.com/run-llama/llama_index/blob/cf0da01e0cc756383e07eb499cb9825cfa17984d/docs/examples/vector_stores/MongoDBAtlasVectorSearchRAGFireworks.ipynb)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.12" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file