Skip to content

An application that crawls websites, generates embeddings using Azure OpenAI, and stores them in an Azure AI Search Index for RAG systems.

License

Notifications You must be signed in to change notification settings

amgdy/azure-ai-search-website-crawler

Repository files navigation

Azure AI Search Website Crawler

Open in GitHub Codespaces Open in Dev Containers

Description

The Azure AI Search Website Crawler is an application designed to crawl a website, generate embeddings for the crawled content using Azure OpenAI, and store these embeddings in an Azure AI Search Index. This setup is particularly useful for Retrieval-Augmented Generation (RAG) systems.

Used Technologies

The application is built using the following technologies:

  • .NET 9: The core application is developed using .NET 9, which provides a robust and scalable framework for building modern applications.
  • Containerization: The application is packaged as container images, ensuring consistency across different environments and simplifying the deployment process.
  • Azure Container App Job: The containerized application is deployed to Azure Container App Job, which provides a managed environment for running containerized applications in the cloud.
  • Abot NuGet Library: To crawl the website, the application uses the Abot NuGet library. Abot is a powerful and flexible web crawler library for .NET that allows for efficient and customizable web crawling.
  • Azure OpenAI: During the indexing process, the application uses the Azure OpenAI embedding model to generate vector representations of the crawled content. These embeddings are then stored in the Azure AI Search Index, enabling advanced search capabilities.
  • Azure AI Search: Azure AI Search is used as a vector store to store the crawled data along with the embeddings generated by Azure OpenAI. This allows for efficient retrieval and search of the indexed content.

Deployment

Prerequisites

  • Azure Subscription
  • Azure CLI
  • Docker
  • .NET SDK
  • Azure Developer CLI (azd)

Steps

  1. Clone the repository:

    git clone https://github.com/amgdy/azure-ai-search-website-crawler.git
    cd azure-ai-search-website-crawler
  2. Login to Azure:

    Authenticate with both Azure CLI and Azure Developer CLI.

    az login
    azd auth login
  3. Deploy the Application:

    Use the Azure Developer CLI to deploy the application.

    azd up

Environment Variables

The following environment variables are required for the application to run. These can be set in the .env file or directly in the deployment scripts.

Variable Name Description Required Default Value
APPLICATIONINSIGHTS_CONNECTION_STRING Connection string for Azure Application Insights ✔️
AzureAiSearch__ApiKey API key for Azure AI Search service
AzureAiSearch__EndpointUrl Endpoint URL for Azure AI Search service ✔️
AzureAiSearch__IndexName Index name for the Azure AI Search service
AzureOpenAi__ApiKey API key for Azure OpenAI service
AzureOpenAi__EmbeddingModelDeployment Deployment name for the Azure OpenAI embedding model ✔️ text-embedding-ada-002
AzureOpenAi__EmbeddingModelDimensions Dimensions for the Azure OpenAI embedding model ✔️ 1536
AzureOpenAi__EmbeddingModelMaxTokens Maximum tokens for the Azure OpenAI embedding model ✔️ 8100
AzureOpenAi__EndpointUrl Endpoint URL for Azure OpenAI service ✔️
TextSplitter__DefaultOverlapPercent Default overlap percentage for the text splitter ✔️ 10
TextSplitter__DefaultSectionLength Default section length for the text splitter ✔️ 1000
TextSplitter__MaxTokensPerSection Maximum tokens per section for the text splitter ✔️ 500
TextSplitter__SentenceSearchLimit Sentence search limit for the text splitter ✔️ 100
WebCrawler__MaxBatchSize Maximum batch size for the web crawler ✔️ 100
WebCrawler__MaxCrawlDepth Maximum crawl depth for the web crawler ✔️ 5
WebCrawler__MaxPagesToCrawl Maximum pages to crawl for the web crawler ✔️ 300
WebCrawler__MaxRetryAttempts Maximum retry attempts for the web crawler ✔️ 3
WebCrawler__Url URL of the website to crawl ✔️

Running the Application

To run the application locally, use the following command:

dotnet run --project app/AzureAiSearchWebsiteCrawler

About

An application that crawls websites, generates embeddings using Azure OpenAI, and stores them in an Azure AI Search Index for RAG systems.

Resources

License

Stars

Watchers

Forks

Packages