diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..4bebe0c --- /dev/null +++ b/404.html @@ -0,0 +1,381 @@ + + + +
+ + + + + + + + + + + + + + + + +SeaSearch uses Http Basic Auth for permission verification, and the API request needs to carry the corresponding token in the header.
+# headers
+{
+ 'Authorization': 'Basic <basic auth token>'
+}
+
SeaSearch manages API permissions through accounts. When the program is started for the first time, an administrator account needs to be configured through environment variables.
+The following is an example of an administrator account:
+set ZINC_FIRST_ADMIN_USER=admin
+set ZINC_FIRST_ADMIN_PASSWORD=xxx
+
Users can be created/updated via the API:
+[POST] /api/user
+
+{
+ "_id": "prabhat",
+ "name": "Prabhat Sharma",
+ "role": "admin", // or user
+ "password": "xxx"
+}
+
get all users:
+[GET] /api/user
+
delete user:
+[DELETE] /api/user/${userId}
+
Create a SeaSearch index, and you can set both mappings and settings at the same time.
+We can also set settings or mapping directly through other requests. If the index does not exist, it will be created automatically.
+SeaSearch documentation:https://zincsearch-docs.zinc.dev/api/index/create/#update-a-exists-index
+ES documentation:https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html
+Mappings define the rules for fields in a document, such as type, format, etc.
+Mapping can be configured via a separate API:
+SeaSearch api: https://zincsearch-docs.zinc.dev/api-es-compatible/index/update-mapping/
+ES related instructions:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
+Settings set the analyzer sharding and other related settings of the index.
+SeaSearch api: https://zincsearch-docs.zinc.dev/api-es-compatible/index/update-settings/
+ES related instructions:
+analyzer related concepts:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-concepts.html
+How to specify an analyzer:https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html
+Analyzer can configure the default when creating an index, or set it for a specific field. (Refer to the settings ES documentation in the previous section to understand the relevant concepts.)
+The analyzers supported by SeaSearch can be found on this page: https://zincsearch-docs.zinc.dev/api/index/analyze/. The concepts such as tokenize and token filter are consistent with ES, and most of the commonly used analyzers and tokenizers in ES are supported.
+Supported general analyzers
+standard, the default analyzer. If not specified, this analyzer is used to split words and lowercase them.
+simple, split according to non-letters (symbols are filtered), lowercase
+keyword, no word segmentation, directly treat input as output
+stop, lowercase, stop word filter (the, a, is, etc.)
+web, implemented by Bluge, matching email addresses, urls, etc. Handling lowercase, using stop word filters
+regexp/pattern, regular expression, default is \W+ (non-character segmentation), supports lowercase and stop words
+whitespace, split by space, do not convert to lowercase
+Country | +Shortened form | +
---|---|
arabic | +ar | +
Asia Countries | +cjk | +
sorani | +ckb | +
danish | +da | +
german | +de | +
english | +en | +
spanish | +es | +
persian | +fa | +
finnish | +fi | +
french | +fr | +
hindi | +hi | +
hungarian | +hu | +
italian | +it | +
dutch | +nl | +
norwegian | +no | +
portuguese | +pt | +
romanian | +ro | +
russian | +ru | +
swedish | +sv | +
turkish | +tr | +
Chinese analyzer:
+gse_standard, use the shortest path algorithm to segment words
+gse_search, the search engine's word segmentation mode provides as many keywords as possible
+The Chinese analyzer uses the gse library to implement word segmentation. It is a Golang implementation of the Python stammer library. It is not enabled by default and needs to be enabled through environment variables.
+ZINC_PLUGIN_GSE_ENABLE=true
+# true: enable Chinese word segmentation support, default is false
+
+ZINC_PLUGIN_GSE_DICT_EMBED=BIG
+# BIG: use the gse built-in vocabulary and stop words; otherwise, use the SeaSearch built-in simple vocabulary, the default is small
+
+ZINC_PLUGIN_GSE_ENABLE_STOP=true
+# true: use stop words, default true
+
+ZINC_PLUGIN_GSE_ENABLE_HMM=true
+# Use HMM mode for search word segmentation, default is true
+
+ZINC_PLUGIN_GSE_DICT_PATH=./plugins/gse/dict
+# To use a user-defined word library and stop words, you need to put the content in the configured path, and name the word library user.txt and the stop words stop.txt
+
create document:
+SeaSearch API: https://zincsearch-docs.zinc.dev/api-es-compatible/document/create/
+ES API:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
+update document:
+SeaSearch API: https://zincsearch-docs.zinc.dev/api-es-compatible/document/update/
+ES API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
+delete document:
+SeaSearch API: https://zincsearch-docs.zinc.dev/api-es-compatible/document/delete/
+ES API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html
+Get document by id:
+[GET] /api/${indexName}/_doc/${docId}
+
Batch operations should be used to update indexes whenever possible.
+SeaSearch API: https://zincsearch-docs.zinc.dev/api-es-compatible/document/bulk/#request
+ES API:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
+API examples:
+https://zincsearch-docs.zinc.dev/api-es-compatible/search/search/
+Full-text search uses DSL. For usage, please refer to:
+https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
+delete-by-query:Delete based on query
+[POST] /es/${indexName}/_delete_by_query
+
+{
+ "query": {
+ "match": {
+ "name": "jack"
+ }
+ }
+}
+
ES API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
+multi-search,supports executing different queries on different indexes:
+SeaSearch API: https://zincsearch-docs.zinc.dev/api-es-compatible/search/msearch/
+ES API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html
+We have extended multi-search to support using the same statistics when searching different indexes to make the score calculation more accurate. You can enable it by setting query: unify_score=true in the request.
+[POST] /es/_msearch?unify_score=true
+
+{"index": "t1"}
+{"query": {"bool": {"should": [{"match": {"filename": {"query": "test string", "minimum_should_match": "-25%"}}}, {"match": {"filename.ngram": {"query": "test string", "minimum_should_match": "80%"}}}], "minimum_should_match": 1}}, "from": 0, "size": 10, "_source": ["path", "repo_id", "filename", "is_dir"], "sort": ["_score"]}
+{"index": "t2"}
+{"query": {"bool": {"should": [{"match": {"filename": {"query": "test string", "minimum_should_match": "-25%"}}}, {"match": {"filename.ngram": {"query": "test string", "minimum_should_match": "80%"}}}], "minimum_should_match": 1}}, "from": 0, "size": 10, "_source": ["path", "repo_id", "filename", "is_dir"], "sort": ["_score"]}
+
We have developed a vector search function for the SeaSearch extension. The following is an introduction to the relevant API.
+To use the vector search function, you need to create a vector index in advance, which can be done through mapping.
+We create an index and set the vector field of the document data to be written to be called "vec", the index type is flat, and the vector dimension is 768
+[PUT] /es/${indexName}/_mapping
+
+{
+"properties":{
+ "vec":{
+ "type":"vector",
+ "dims":768,
+ "m":64,
+ "nbits":8,
+ "vec_index_type":"flat"
+ }
+ }
+}
+
Parameter Description:
+${indexName} zincIndex, index name
+
+type, fixed to vector, indicating vector index
+dims, vector dimensions
+m, ivf_pq index required parameters, need to be divisible by dims
+nbits, ivf_pq index required parameter, default is 8
+vec_index_type, index type, supports two types: flat and ivf_pq
+
There is no difference between writing a document containing a vector and writing a normal document at the API level. You can choose the appropriate method.
+The following takes the bluk API as an example
+[POST] /es/_bulk
+
+body:
+
+{ "index" : { "_index" : "index1" } }
+{"name": "jack1","vec":[10.2,10.41,9.5,22.2]}
+{ "index" : { "_index" : "index1" } }
+{"name": "jack2","vec":[10.2,11.41,9.5,22.2]}
+{ "index" : { "_index" : "index1" } }
+{"name": "jack3","vec":[10.2,12.41,9.5,22.2]}
+
Note that the _bulk API strictly requires the format of each line, and the data cannot exceed one line. For details, please refer to ES bulk
+Modification and deletion can also be done using bulk. After deleting a document, its corresponding vector data will also be deleted
+By passing in a vector, we can search for N similar vectors in the system and return the corresponding document information:
+[POST] /api/${indexName}/_search/vector
+
+body:
+{
+ {
+ "query_field":"vec",
+ "k":7,
+ "return_fields":["name"],
+ "vector":[10.2,10.40,9.5,22.2.......],
+ "_source":false
+ }
+}
+
The API response format is the same as the full-text search format.
+The following is a description of the parameters:
+${indexName} zincIndex, index name
+
+query_field, the field in the index to retrieve, the field must be of vector type
+k, the number of K most similar vectors to return
+return_fields, the name of the field to be returned individually
+vector, the vector used for query
+nprobe, only works for ivf_pq index type, the number of clusters to query, the higher the number, the more accurate
+_source, it is used to control whether to return the _source field, supports bool or an array, describing which fields need to be returned
+
Rebuild the index immediately, suitable for situations where you don't need to wait for background automatic detection.
+[POST] /api/:target/:field/_rebuild
+
For vectors of type ivf_pq, recall checks can be performed on their data.
+[POST] /api/:target/_recall
+{
+ "field":"vec_001", # Fields to test
+ "k":10,
+ "nprobe":5, # nprobe number
+ "query_count":1000 # Number of times the test was performed
+}
+
Next, we will demonstrate how to index a batch of papers. Each paper may contain multiple vectors that need to be indexed. We hope to obtain the most similar N vectors through vector retrieval, and thus obtain their corresponding paper-ids.
+The first step is to set the mapping of the vector index. When setting the mapping, the index and vector index are automatically created.
+Since paper-id is just a normal string, we don't need to analyze it, so we set its type to keyword:
+[PUT] /es/paper/_mapping
+
+{
+"properties":{
+ "title-vec":{
+ "type":"vector",
+ "dims":768,
+ "vec_index_type":"flat",
+ "m":1
+ },
+ "paper-id":{
+ "type":"keyword"
+ }
+ }
+}
+
Through the above request, we created an index named paper and established a flat vector index for the title-vec field of the index.
+We write these paper data to SeaSearch in batches through the _bulk API.
+[POST] /es/_bulk
+
+{ "index" : {"_index" : "paper" } }
+{"paper-id": "001","title-vec":[10.2,10.40,9.5,22.2....]}
+{ "index" : {"_index" : "paper" } }
+{"paper-id": "002","title-vec":[10.2,11.40,9.5,22.2....]}
+{ "index" : {"_index" : "paper" } }
+{"paper-id": "003","title-vec":[10.2,12.40,9.5,22.2....]}
+....
+
Now we can retrieve it using the vector:
+[POST] /api/paper/_search/vector
+
+{
+ "query_field":"title-vec",
+ "k":10,
+ "return_fields":["paper-id"],
+ "vector":[10.2,10.40,9.5,22.2....]
+}
+
The document corresponding to the most similar vector can be retrieved, and the paper-id can be obtained. Since a paper may contain multiple vectors, if multiple vectors of a paper are very similar to the query vector, then this paper-id may appear multiple times in the results.
+After a document is successfully imported, SeaSearch will return its doc id. We can directly update a document based on the doc id:
+[POST] /es/_bulk
+
+{ "update" : {"_id":"23gZX9eT6QM","_index" : "paper" } }
+{"paper-id": "005","vec":[10.2,1.43,9.5,22.2...]}
+
If the returned doc id is not saved, you can first use SeaSearch's full-text search function to query the documents corresponding to paper-id:
+[POST] /es/paper/_search
+
+{
+ "query": {
+ "bool": {
+ "must": [
+ {
+ "term": {"paper-id":"003"}
+ }
+ ]
+ }
+ }
+}
+
Through DSL, we can directly retrieve the document corresponding to the paper-id and its doc id.
+A paper contains multiple vectors. If a vector needs to be updated, we can directly update the document corresponding to the vector. However, in actual applications, it is not easy to distinguish which contents of a paper are newly added and which are updated.
+We can adopt the method of full update:
+First, query all documents of a paper through DSL
+Delete all documents
+Import the latest paper data
+Steps 2 and 3 can be performed in one batch operation.
+The following example will demonstrate deleting the document of paper 001 and re-importing it; at the same time, directly updating paper 005 and paper 006 because they only have one vector:
+[POST] /es/_bulk
+
+
+{ "index" : {"_index" : "paper" } }
+{"paper-id": "001","title-vec":[10.2,10.40,9.5,22.2....]}
+{ "index" : {"_index" : "paper" } }
+{"paper-id": "002","title-vec":[10.2,11.40,9.5,22.2....]}
+{ "index" : {"_index" : "paper" } }
+{"paper-id": "003","title-vec":[10.2,12.40,9.5,22.2....]}
+....
+