This sample application combines two sample applications to implement cost-efficient large scale image search over multimodal AI powered vector representations; text-image-search and billion-scale-vector-search.
This sample app use the LAION-5B dataset, the biggest open accessible image-text dataset in the world.
Large image-text models like ALIGN, BASIC, Turing Bletchly, FLORENCE & GLIDE have shown better and better performance compared to previous flagship models like CLIP and DALL-E. Most of them had been trained on billions of image-text pairs and unfortunately, no datasets of this size had been openly available until now. To address this problem we present LAION 5B, a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages and 1B samples have texts that do not allow a certain language assignment (e.g. names ).
The LAION-5B dataset was used to train the popular text-to-image generative StableDiffusion model.
Note the following about the LAION 5B dataset
Be aware that this large-scale dataset is un-curated. Keep in mind that the un-curated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer.
The released dataset does not contain image data itself,
but CLIP encoded vector representations of the images,
and metadata like url
and caption
.
The app can be used to implement several use cases over the LAION dataset, or adopted to your large-scale vector dataset:
- Search with a free text prompt over the
caption
orurl
fields in the LAION dataset using Vespa's standard text-matching functionality. - CLIP retrieval, using vector search, given a text prompt, search the image vector representations (CLIP ViT-L/14), for example for 'french cat'.
- Given an image vector representation, search for similar images in the dataset. This can for example be used to take the output image of StableDiffusion to find similar images in the training dataset.
All this combined using Vespa's query language, and also in combination with filters.
The sample application demonstrates many Vespa primitives:
- Importing an ONNX-exported version of CLIP ViT-L/14 for accelerated inference in Vespa stateless containers. The exported CLIP model encodes a free-text prompt to a joint image-text embedding space with 768 dimensions.
- HNSW indexing of vector centroids drawn from the dataset, and combination with classic Inverted File as described in Billion-scale vector search using hybrid HNSW-IF.
- Decoupling of vector storage and vector similarity computations. The stateless layer performs vector similarity computation over the full precision vectors. By using Vespa's support for accelerated inference with onnxruntime, moving the majority of the vector compute to the stateless layer allows for faster auto-scaling with daily query volume changes. The full precision vectors are stored in Vespa's summary log store, using lossless compression (zstd).
- Dimension reduction with PCA - The centroid vectors are compressed from 768 dimensions to 128 dimensions. This allows indexing 6x more centroids on the same instance type due to the reduced memory footprint. With Vespa's support for distributed search, coupled with powerful high memory instances, this allows Vespa to scale cost efficiently to trillion-sized vector datasets.
- The trained PCA matrix and matrix multiplication which projects the 768-dim vectors to 128-dimensions is evaluated in Vespa using accelerated inference, both at indexing time and at query time. The PCA weights are represented also using ONNX.
- Phased ranking. The image embedding vectors are also projected to 128 dimensions, stored using memory mapped paged attribute tensors. Full precision vectors are on stored on disk in Vespa summary store. The first-phase coarse search ranks vectors in the reduced vector space, per node, and results are merged from all nodes before the final ranking phase in the stateless layer. The final ranking phase is implemented in the stateless container layer using accelerated inference.
- Combining approximate nearest neighbor search with filters, filtering can be on url, caption, image height, width, safety probability, NSFW label, and more.
- Hybrid ranking, both textual sparse matching features and the CLIP similarity, can be used when ranking images.
- Reduced tensor cell precision. The original LAION-5B dataset uses
float16
. The app uses Vespa's support forbfloat16
tensors, saving 50% of storage compared to fullfloat
representation. - Caching, both reduced vectors (128) cached by the OS buffer cache, and full version 768 dims are cached using Vespa summary cache.
- Query-time vector de-duping and diversification of the search engine result page using document to document similarity instead of query to document similarity. Also accelerated by stateless model inference.
- Scale, from a single node deployment to multi-node deployment using managed Vespa Cloud, or self-hosted on-premise.
The app contains several container components:
- RankingSearcher implements the last stage ranking using full-precision vectors using an ONNX model for accelerated inference.
- DedupingSearcher implements run-time de-duping after Ranking, using document to document similarity matrix, using an ONNX model for accelerated inference.
- DimensionReducer PCA dimension reducing vectors from 768-dims to 128-dims.
- AssignCentroidsDocProc searches the HNSW graph content cluster during ingestion to find the nearest centroids of the incoming vector.
- SPANNSearcher
These reproducing steps, demonstrates the app using a smaller subset of the LAION-5B vector dataset, suitable for playing around with the app on a laptop.
Requirements:
- Docker Desktop installed and running. 6GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
- Alternatively, deploy using Vespa Cloud
- Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
- Architecture: x86_64 or arm64
- Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.
- Java 17 installed.
- Python3 and numpy to process the vector dataset
- Apache Maven - this sample app uses custom Java components and Maven is used to build the application.
Verify Docker Memory Limits:
$ docker info | grep "Total Memory" or $ podman info | grep "memTotal"
Install Vespa CLI:
$ brew install vespa-cli
For local deployment using docker image:
$ vespa config set target local
Use the multi-node high availability template for inspiration for multi-node, on-premise deployments.
Pull and start the vespa docker container image:
$ docker pull vespaengine/vespa $ docker run --detach --name vespa --hostname vespa-container \ --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \ vespaengine/vespa
Verify that the configuration service (deploy api) is ready:
$ vespa status deploy --wait 300
Download this sample application:
$ vespa clone billion-scale-image-search myapp && cd myapp
These instructions use the first split file (0000) of a total of 2314 files in the LAION2B-en split. Download the vector data file:
$ curl --http1.1 -L -o img_emb_0000.npy \ https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/img_emb_0000.npy
Download the metadata file:
$ curl -L -o metadata_0000.parquet \ https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/metadata_0000.parquet
Install python dependencies to process the files:
$ python3 -m pip install pandas numpy requests mmh3 pyarrow
Generate centroids, this process randomly selects vectors from the dataset to represent centroids. Performing an incremental clustering can improve vector search recall and allow indexing fewer centroids. For simplicity, this tutorial uses random sampling.
$ python3 src/main/python/create-centroid-feed.py img_emb_0000.npy > centroids.jsonl
Generate the image feed, this merges the embedding data with the metadata and creates a Vespa jsonl feed file, with one json operation per line.
$ python3 src/main/python/create-joined-feed.py metadata_0000.parquet img_emb_0000.npy > feed.jsonl
To process the entire dataset, we recommend starting several processes, each operating on separate split files as the processing implementation is single-threaded.
src/main/application/models
has three small ONNX models:
vespa_innerproduct_ranker.onnx
for vector similarity (inner dot product) between the query and the vectors in the stateless container.vespa_pairwise_similarity.onnx
for matrix multiplication between the top retrieved vectors.pca_transformer.onnx
for dimension reduction, projecting the 768-dim vector space to a 128-dimensional space.
These ONNX
model files are generated by specifying the compute operation using pytorch and using torch
's
ability to export the model to ONNX format:
Build the sample app (make sure you have JDK 17, verify with mvn -v
): - This step
also downloads a pre-exported ONNX model for mapping the prompt text to the CLIP vector embedding space.
$ mvn clean package -U
Deploy the application. This step deploys the application package built in the previous step:
$ vespa deploy --wait 300
It is possible to deploy this app to
Vespa Cloud.
For Vespa cloud deployments to perf env
replace the src/main/application/services.xml with
src/main/application/services-cloud.xml -
the cloud deployment uses dedicated clusters for feed
and query
.
Wait for the application endpoint to become available:
$ vespa status --wait 300
Run Vespa System Tests, which runs a set of basic tests to verify that the application is working as expected:
$ vespa test src/test/application/tests/system-test/feed-and-search-test.json
The centroid vectors must be indexed first:
$ vespa feed centroids.jsonl $ vespa feed feed.jsonl
Track number of documents while feeding:
$ vespa query 'yql=select * from image where true' \ hits=0 \ ranking=unranked
Fetch a single document using document api:
$ vespa document get \ id:laion:image::5775990047751962856
The response contains all fields, including the full vector representation and the reduced vector, plus all the metadata. Everything represented in the same schema.
The following provides a few query examples,
prompt
is a run-time query parameter which is used by the
CLIPEmbeddingSearcher
which will encode the prompt text into a CLIP vector representation using the embedded CLIP model:
$ vespa query \ 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely"'\ 'hits=10' \ 'prompt=two dogs running on a sandy beach'
Results are filtered by a constraint on the nsfw
field. Note that even if the image is classified
as unlikely
the image content might still be explicit as the NSFW classifier is not 100% accurate.
The returned images are ranked by CLIP similarity (The score is found in each hit's relevance
field).
The following query adds another filter, restricting the search so that only images crawled from urls with shutterstock.com
is retrieved.
$ vespa query \ 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely" and url contains "shutterstock.com"'\ 'hits=10' \ 'prompt=two dogs running on a sandy beach'
Another restricting the search further, adding a phrase constraint caption contains phrase("sandy", "beach")
:
$ vespa query \ 'yql=select documentid, caption, url, height, width from image where nsfw contains "unlikely" and url contains "shutterstock.com" and caption contains phrase("sandy", "beach")'\ 'hits=10' \ 'prompt=two dogs running on a sandy beach'
Regular query, matching over the default
fieldset, searching the caption
and the url
field, ranked by
the text
ranking profile:
$ vespa query \ 'yql=select documentid, caption, url from image where nsfw contains "unlikely" and userQuery()'\ 'hits=10' \ 'query=two dogs running on a sandy beach' \ 'ranking=text'
The text
rank profile uses
nativeRank, one of Vespa's many
text matching rank features.
There are several non-native query request
parameters that controls the vector search accuracy and performance tradeoffs. These
can be set with the request, e.g, /search/&spann.clusters=12
.
spann.clusters
, default64
, the number of centroids in the reduced vector space used to restrict the image search. A higher number improves recall, but increases computational complexity and disk reads.rank-count
, default1000
, the number of vectors that are fully re-ranked in the container using the full vector representation. A higher number improves recall, but increases the computational complexity and network.collapse.enable
, defaulttrue
, controls de-duping of the top ranked results using image to image similarity.collapse.similarity.max-hits
, default1000
, the number of top-ranked hits to perform de-duping of. Must be less thanrank-count
.collapse.similarity.threshold
, default0.95
, how similar a given image to image must be before it is considered a duplicate.
There are several areas that could be improved.
- CLIP model. The exported text transformer model uses fixed sequence length (77), this wastes computations and makes the model a lot slower than it has to be for shorter sequence lengths. A dynamic sequence length would make encoding short queries a lot faster than the current model. It would also be interesting to use the text encoder as a teacher and train a smaller distilled model using a different architecture (for example based on smaller MiniLM models).
- CLIP query embedding caching. The CLIP model is fixed and only uses the text input. Caching the map from text to embedding would save resources.
$ docker rm -f vespa