This project demonstrates semantic document search (document retrieval) using elasticsearch and sentence-transformers. In contrast to the traditional lexical search (e.g. BM25), the semantic search can tolerate spelling mistakes as vector representations can capture notion of similarity between semantically similar words (thanks to word embeddings!). Elasticsearch 7.3 provides a cosineSimilarity
function for vector fields, this enables convenient document retrieval based on vector similarity.
The demo application included in this project can enable the user to search health related questions based on the data scrapped from NHS website.
Download the NHS website data from here and copy the data
directory inside the project.
docker compose up
conda create -n es python=3.8
conda activate es
pip install -r es/requirements.txt
python es/es_pipeline.py
Open the search interface by opening http://localhost:8501/ in your browser.
If you are not happy with the results, try experimenting with pretrained models relevant to your domain, or consider adapting the "general domain" models to your target domain.
I consulted the codebase of bertsearch & pinecone-io/examples for this project.