Azure AI Search Website Crawler

Description

The Azure AI Search Website Crawler is an application designed to crawl a website, generate embeddings for the crawled content using Azure OpenAI, and store these embeddings in an Azure AI Search Index. This setup is particularly useful for Retrieval-Augmented Generation (RAG) systems.

Used Technologies

The application is built using the following technologies:

.NET 9: The core application is developed using .NET 9, which provides a robust and scalable framework for building modern applications.
Containerization: The application is packaged as container images, ensuring consistency across different environments and simplifying the deployment process.
Azure Container App Job: The containerized application is deployed to Azure Container App Job, which provides a managed environment for running containerized applications in the cloud.
Abot NuGet Library: To crawl the website, the application uses the Abot NuGet library. Abot is a powerful and flexible web crawler library for .NET that allows for efficient and customizable web crawling.
Azure OpenAI: During the indexing process, the application uses the Azure OpenAI embedding model to generate vector representations of the crawled content. These embeddings are then stored in the Azure AI Search Index, enabling advanced search capabilities.
Azure AI Search: Azure AI Search is used as a vector store to store the crawled data along with the embeddings generated by Azure OpenAI. This allows for efficient retrieval and search of the indexed content.

Deployment

Prerequisites

Azure Subscription
Azure CLI
Docker
.NET SDK
Azure Developer CLI (azd)

Steps

Clone the repository:

git clone https://github.com/amgdy/azure-ai-search-website-crawler.git
cd azure-ai-search-website-crawler

Login to Azure:

Authenticate with both Azure CLI and Azure Developer CLI.
```
az login
azd auth login
```
Deploy the Application:

Use the Azure Developer CLI to deploy the application.
```
azd up
```

Environment Variables

The following environment variables are required for the application to run. These can be set in the .env file or directly in the deployment scripts.

Variable Name	Description	Required	Default Value
`APPLICATIONINSIGHTS_CONNECTION_STRING`	Connection string for Azure Application Insights	✔️
`AzureAiSearch__ApiKey`	API key for Azure AI Search service	❌
`AzureAiSearch__EndpointUrl`	Endpoint URL for Azure AI Search service	✔️
`AzureAiSearch__IndexName`	Index name for the Azure AI Search service	❌
`AzureOpenAi__ApiKey`	API key for Azure OpenAI service	❌
`AzureOpenAi__EmbeddingModelDeployment`	Deployment name for the Azure OpenAI embedding model	✔️	`text-embedding-ada-002`
`AzureOpenAi__EmbeddingModelDimensions`	Dimensions for the Azure OpenAI embedding model	✔️	`1536`
`AzureOpenAi__EmbeddingModelMaxTokens`	Maximum tokens for the Azure OpenAI embedding model	✔️	`8100`
`AzureOpenAi__EndpointUrl`	Endpoint URL for Azure OpenAI service	✔️
`TextSplitter__DefaultOverlapPercent`	Default overlap percentage for the text splitter	✔️	`10`
`TextSplitter__DefaultSectionLength`	Default section length for the text splitter	✔️	`1000`
`TextSplitter__MaxTokensPerSection`	Maximum tokens per section for the text splitter	✔️	`500`
`TextSplitter__SentenceSearchLimit`	Sentence search limit for the text splitter	✔️	`100`
`WebCrawler__MaxBatchSize`	Maximum batch size for the web crawler	✔️	`100`
`WebCrawler__MaxCrawlDepth`	Maximum crawl depth for the web crawler	✔️	`5`
`WebCrawler__MaxPagesToCrawl`	Maximum pages to crawl for the web crawler	✔️	`300`
`WebCrawler__MaxRetryAttempts`	Maximum retry attempts for the web crawler	✔️	`3`
`WebCrawler__Url`	URL of the website to crawl	✔️

Running the Application

To run the application locally, use the following command:

dotnet run --project app/AzureAiSearchWebsiteCrawler

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
app/AzureAiSearchWebsiteCrawler		app/AzureAiSearchWebsiteCrawler
infra		infra
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
azure.yaml		azure.yaml
next-steps.md		next-steps.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure AI Search Website Crawler

Description

Used Technologies

Deployment

Prerequisites

Steps

Environment Variables

Running the Application

About

Packages

Languages

License

amgdy/azure-ai-search-website-crawler

Folders and files

Latest commit

History

Repository files navigation

Azure AI Search Website Crawler

Description

Used Technologies

Deployment

Prerequisites

Steps

Environment Variables

Running the Application

About

Resources

License

Stars

Watchers

Forks

Packages 0

Languages

Packages