This sample application is used to demonstrate how to improve Product Search with Learning to Rank (LTR).
Blog post series:
- Improving Product Search with Learning to Rank - part one This post introduces the dataset used in this sample application and several baseline ranking models.
- Improving Product Search with Learning to Rank - part two This post demonstrates how to train neural methods for search ranking. The neural training routine is found in this notebook .
- Improving Product Search with Learning to Rank - part three This post demonstrates how to train GBDT methods for search ranking. The model uses also neural signals as features. See notebooks: XGBoost and LightGBM
This work uses the largest product relevance dataset released by Amazon:
We introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, released with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The dataset is multilingual, as it contains queries in English, Japanese, and Spanish.
The dataset is found at amazon-science/esci-data. The dataset is released under the Apache 2.0 license.
The following is a quick start recipe on how to get started with this application.
- Docker Desktop installed and running. 6 GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
- Alternatively, deploy using Vespa Cloud
- Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
- Architecture: x86_64 or arm64
- Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.
- zstd:
brew install zstd
- Python3 with
requests
pyarrow
andpandas
installed
Validate Docker resource settings, should be minimum 6 GB:
$ docker info | grep "Total Memory" or $ podman info | grep "memTotal"
Install Vespa CLI:
$ brew install vespa-cli
For local deployment using docker image:
$ vespa config set target local
Pull and start the vespa docker container image:
$ docker pull vespaengine/vespa $ docker run --detach --name vespa --hostname vespa-container \ --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \ vespaengine/vespa
Verify that configuration service (deploy api) is ready:
$ vespa status deploy --wait 300
Download this sample application:
$ vespa clone commerce-product-ranking my-app && cd my-app
Download cross-encoder model:
$ curl -L -o application/models/title_ranker.onnx \ https://data.vespa-cloud.com/sample-apps-data/title_ranker.onnx
See scripts/export-bi-encoder.py and scripts/export-cross-encoder.py for how to export models from PyTorch to ONNX format.
Deploy the application:
$ vespa deploy --wait 600 application
If the above fails, check the logs:
$ docker logs vespa
It is possible to deploy this app to Vespa Cloud.
This step is optional, but it indexes two documents and runs a query test
$ (cd application; vespa test tests/system-test/feed-and-search-test.json)
Download the pre-processed sample product data for 16 products:
$ zstdcat sample-data/sample-products.jsonl.zstd | vespa feed -
Evaluate the semantic-title
rank profile using the evaluation
script (scripts/evaluate.py).
Install requirements
pip3 install numpy pandas pyarrow requests
$ python3 scripts/evaluate.py \ --endpoint http://localhost:8080/search/ \ --example_file sample-data/test-sample.parquet \ --ranking semantic-title
evaluate.py runs all the queries in the test split using the --ranking
<rank-profile>
and produces a <ranking>.run
file with the top ranked results.
This file is formatted in the format that trec_eval
expects.
$ cat semantic-title.run
Example ranking produced by Vespa using the semantic-title
rank-profile for query 535:
535 Q0 B08PB9TTKT 1 0.46388297538130346 semantic-title 535 Q0 B00B4PJC9K 2 0.4314163871097326 semantic-title 535 Q0 B0051GN8JI 3 0.4199624989861286 semantic-title 535 Q0 B084TV3C1B 4 0.4177780086570998 semantic-title 535 Q0 B08NVQ8MZX 5 0.4175260475587483 semantic-title 535 Q0 B00DHUA9VA 6 0.41558328517364673 semantic-title 535 Q0 B08SHMLP5S 7 0.41512211873088406 semantic-title 535 Q0 B08VSJGP1N 8 0.41479904241634674 semantic-title 535 Q0 B08QGZMCYQ 9 0.41107229418202607 semantic-title 535 Q0 B0007KPRIS 10 0.4073851390694049 semantic-title 535 Q0 B08VJ66CNL 11 0.4040355668337184 semantic-title 535 Q0 B000J1HDWI 12 0.40354871020728317 semantic-title 535 Q0 B0007KPS3C 13 0.39775755175088207 semantic-title 535 Q0 B0072LFB68 14 0.39334250744409155 semantic-title 535 Q0 B01M0SFMIH 15 0.3920197770681833 semantic-title 535 Q0 B0742BZXC2 16 0.3778094352830984 semantic-title
This run file can then be evaluated using the trec_eval utility.
Download a pre-processed query-product relevance judgments in TREC format:
$ curl -L -o test.qrels \ https://data.vespa-cloud.com/sample-apps-data/test.qrels
Install trec_eval
(your mileage may vary):
git clone --depth 1 --branch v9.0.8 https://github.com/usnistgov/trec_eval && cd trec_eval && make install && cd ..
Run evaluation :
$ trec_eval test.qrels semantic-title.run -m 'ndcg.1=0,2=0.01,3=0.1,4=1'
This particular product ranking for the query produces a NDCG score of 0.7046.
Note that the sample-data/test-sample.parquet
file only contains one query.
To get the overall score, one must compute all the NDCG scores of all queries in the
test split and report the average NDCG score.
Note that the evaluation uses custom NDCG label gains:
- Label 1 is Irrelevant with 0 gain
- Label 2 is Supplement with 0.01 gain
- Label 3 is Complement with 0.1 gain
- Label 4 is Exact with 1 gain
We can also try another ranking model:
$ python3 scripts/evaluate.py \ --endpoint http://localhost:8080/search/ \ --example_file sample-data/test-sample.parquet \ --ranking cross-title
$ trec_eval test.qrels cross-title.run -m 'ndcg.1=0,2=0.01,3=0.1,4=1'
Which for this query produces a NDCG score of 0.8208, better than the semantic-title model.
$ docker rm -f vespa
Download a pre-processed feed file with all (1,215,854) products:
$ curl -L -o product-search-products.jsonl.zstd \ https://data.vespa-cloud.com/sample-apps-data/product-search-products.jsonl.zstd
This step is resource intensive as the semantic embedding model encodes the product title and description into the dense embedding vector space.
$ zstdcat product-search-products.jsonl.zstd | vespa feed -
Evaluate the hybrid
baseline rank profile using the evaluation
script (scripts/evaluate.py).
$ python3 scripts/evaluate.py \ --endpoint http://localhost:8080/search/ \ --example_file "https://github.com/amazon-science/esci-data/blob/main/shopping_queries_dataset/shopping_queries_dataset_examples.parquet?raw=true" \ --ranking semantic-title
For Vespa cloud deployments we need to pass certificate and the private key.
$ python3 scripts/evaluate.py \ --endpoint https://productsearch.samples.aws-us-east-1c.perf.z.vespa-app.cloud/search/ \ --example_file "https://github.com/amazon-science/esci-data/blob/main/shopping_queries_dataset/shopping_queries_dataset_examples.parquet?raw=true" \ --ranking semantic-title \ --cert <path-to-data-plane-cert.pem> \ --key <path-to-data-plane-private-key.pem>
Run evaluation using trec_eval
:
$ trec_eval test.qrels semantic-title.run -m 'ndcg.1=0,2=0.01,3=0.1,4=1