The Azure AI Search Website Crawler is an application designed to crawl a website, generate embeddings for the crawled content using Azure OpenAI, and store these embeddings in an Azure AI Search Index. This setup is particularly useful for Retrieval-Augmented Generation (RAG) systems.
The application is built using the following technologies:
- .NET 9: The core application is developed using .NET 9, which provides a robust and scalable framework for building modern applications.
- Containerization: The application is packaged as container images, ensuring consistency across different environments and simplifying the deployment process.
- Azure Container App Job: The containerized application is deployed to Azure Container App Job, which provides a managed environment for running containerized applications in the cloud.
- Abot NuGet Library: To crawl the website, the application uses the
Abot
NuGet library. Abot is a powerful and flexible web crawler library for .NET that allows for efficient and customizable web crawling. - Azure OpenAI: During the indexing process, the application uses the Azure OpenAI embedding model to generate vector representations of the crawled content. These embeddings are then stored in the Azure AI Search Index, enabling advanced search capabilities.
- Azure AI Search: Azure AI Search is used as a vector store to store the crawled data along with the embeddings generated by Azure OpenAI. This allows for efficient retrieval and search of the indexed content.
- Azure Subscription
- Azure CLI
- Docker
- .NET SDK
- Azure Developer CLI (
azd
)
-
Clone the repository:
git clone https://github.com/amgdy/azure-ai-search-website-crawler.git cd azure-ai-search-website-crawler
-
Login to Azure:
Authenticate with both Azure CLI and Azure Developer CLI.
az login azd auth login
-
Deploy the Application:
Use the Azure Developer CLI to deploy the application.
azd up
The following environment variables are required for the application to run. These can be set in the .env
file or directly in the deployment scripts.
Variable Name | Description | Required | Default Value |
---|---|---|---|
APPLICATIONINSIGHTS_CONNECTION_STRING |
Connection string for Azure Application Insights | ✔️ | |
AzureAiSearch__ApiKey |
API key for Azure AI Search service | ❌ | |
AzureAiSearch__EndpointUrl |
Endpoint URL for Azure AI Search service | ✔️ | |
AzureAiSearch__IndexName |
Index name for the Azure AI Search service | ❌ | |
AzureOpenAi__ApiKey |
API key for Azure OpenAI service | ❌ | |
AzureOpenAi__EmbeddingModelDeployment |
Deployment name for the Azure OpenAI embedding model | ✔️ | text-embedding-ada-002 |
AzureOpenAi__EmbeddingModelDimensions |
Dimensions for the Azure OpenAI embedding model | ✔️ | 1536 |
AzureOpenAi__EmbeddingModelMaxTokens |
Maximum tokens for the Azure OpenAI embedding model | ✔️ | 8100 |
AzureOpenAi__EndpointUrl |
Endpoint URL for Azure OpenAI service | ✔️ | |
TextSplitter__DefaultOverlapPercent |
Default overlap percentage for the text splitter | ✔️ | 10 |
TextSplitter__DefaultSectionLength |
Default section length for the text splitter | ✔️ | 1000 |
TextSplitter__MaxTokensPerSection |
Maximum tokens per section for the text splitter | ✔️ | 500 |
TextSplitter__SentenceSearchLimit |
Sentence search limit for the text splitter | ✔️ | 100 |
WebCrawler__MaxBatchSize |
Maximum batch size for the web crawler | ✔️ | 100 |
WebCrawler__MaxCrawlDepth |
Maximum crawl depth for the web crawler | ✔️ | 5 |
WebCrawler__MaxPagesToCrawl |
Maximum pages to crawl for the web crawler | ✔️ | 300 |
WebCrawler__MaxRetryAttempts |
Maximum retry attempts for the web crawler | ✔️ | 3 |
WebCrawler__Url |
URL of the website to crawl | ✔️ |
To run the application locally, use the following command:
dotnet run --project app/AzureAiSearchWebsiteCrawler