From 5a1071a4b206828bf0ccab238e080c6eb4b89807 Mon Sep 17 00:00:00 2001 From: Kaniu Date: Fri, 13 Dec 2024 16:43:01 +0800 Subject: [PATCH] update API and configuration document (#4) * Update seasearch_api.md * Update seasearch_api.md * Update README.md * Update seasearch_api.md * Update README.md * Update README.md * Create overview.md * Create authentication.md * Create index_management.md * Create docmuent_opreation.md * Create search_document.md * Update mkdocs.yml * Delete manual/api/seasearch_api.md * Create document_operation.md * Delete manual/api/docmuent_opreation.md * Update README.md --- manual/api/authentication.md | 47 +++ manual/api/document_operation.md | 32 ++ manual/api/index_management.md | 60 ++++ manual/api/overview.md | 11 + manual/api/search_document.md | 28 ++ manual/api/seasearch_api.md | 496 ------------------------------- manual/config/README.md | 128 ++++++-- mkdocs.yml | 6 +- 8 files changed, 279 insertions(+), 529 deletions(-) create mode 100644 manual/api/authentication.md create mode 100644 manual/api/document_operation.md create mode 100644 manual/api/index_management.md create mode 100644 manual/api/overview.md create mode 100644 manual/api/search_document.md delete mode 100644 manual/api/seasearch_api.md diff --git a/manual/api/authentication.md b/manual/api/authentication.md new file mode 100644 index 0000000..cd6c93d --- /dev/null +++ b/manual/api/authentication.md @@ -0,0 +1,47 @@ +# API Authentication +SeaSearch uses HTTP Basic Auth for authentication. API requests must include the corresponding basic auth token in the header. + +To generate a basic auth token, combine the username and password with a colon (e.g., aladdin:opensesame), and then base64 encode the resulting string (e.g., YWxhZGRpbjpvcGVuc2VzYW1l). + +You can generate a token using the following command, for example with aladdin:opensesame: + +``` +echo -n 'aladdin:opensesame' | base64 +YWxhZGRpbjpvcGVuc2VzYW1l +``` +Note: Basic auth is not secure. If you need to access SeaSearch over the public internet, it is strongly recommended to use HTTPS (e.g., via reverse proxy such as Nginx). +``` +"Authorization": "Basic YWRtaW46MTIzNDU2Nzg=" +``` + +## Administrator User +SeaSearch uses accounts to manage API permissions. When the program starts for the first time, an administrator account must be configured through environment variables. + +Here is an example of setting the administrator account via shell: +``` +set ZINC_FIRST_ADMIN_USER=admin +set ZINC_FIRST_ADMIN_PASSWORD=Complexpass#123 +``` +!!! tip +In most scenarios, you can use the administrator account to provide access for applications. Only when you need to integrate multiple applications with different permissions, you should create regular users. + + +## Regular Users +You can create/update users via the API: +``` +[POST] /api/user +{ + "_id": "prabhat", + "name": "Prabhat Sharma", + "role": "admin", // or user + "password": "Complexpass#123" +} +``` +To get all users: +``` +[GET] /api/user +``` +To delete a user: +``` +[DELETE] /api/user/${userId} +``` diff --git a/manual/api/document_operation.md b/manual/api/document_operation.md new file mode 100644 index 0000000..e0f89bb --- /dev/null +++ b/manual/api/document_operation.md @@ -0,0 +1,32 @@ +## Document Operations +An index stores multiple documents. Users can perform CRUD operations (Create, Read, Update, Delete) on documents via the API. In SeaSearch, each document has a unique ID. + +!!! tip +Due to architectural design, SeaSearch’s performance for single document CRUD operations is much lower than that of ElasticSearch. Therefore, we recommend using batch operations whenever possible. + +ElasticSearch Document APIs contain many additional parameters that are not meaningful to SeaSearch and are not supported. All query parameters are unsupported. + +### Create Document +ElasticSearch API: [Index Document](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html) + +### Update Document +ElasticSearch’s update API supports partial updates to fields. SeaSearch only supports full document updates and does not support updating data via script or detecting if an update is a no-op. + +If the document does not exist during an update, SeaSearch will create the corresponding document. + +ElasticSearch API: [Update Document](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html) + +### Delete Document +Delete a document by its ID. + +ElasticSearch API: [Delete Document](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html) + +### Get Document by ID +``` +[GET] /api/${indexName}/_doc/${docId} +``` + +### Batch Operations +It is recommended to use batch operations to update indexes. + +ElasticSearch API: [Bulk Document API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) diff --git a/manual/api/index_management.md b/manual/api/index_management.md new file mode 100644 index 0000000..fee13d4 --- /dev/null +++ b/manual/api/index_management.md @@ -0,0 +1,60 @@ +## Index Management +In SeaSearch, users can create any number of indexes. An index is a collection of documents that can be searched, and a document can contain multiple searchable fields. Users specify the fields contained in the index via mappings and can customize the analyzers available to the index through settings. Each field can specify either a built-in or custom analyzer. The analyzer is used to split the content of a field into searchable tokens. + +### Create Index +To create a SeaSearch index, you can configure the mappings and settings at the same time. For more details about mappings and settings, refer to the following sections. + +ElasticSearch API: [Create Index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) + +### Configure Mappings +Mappings define the types and attributes of fields in a document. Users can configure the mapping via the API. + +SeaSearch supports the following field types: + +- text +- keyword +- numeric +- bool +- date +- vector + +Other types, such as flattened, object, nested, etc., are not supported, and mappings do not support modifying existing fields (new fields can be added). + +ElasticSearch Mappings API: [Put Mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html) + +ElasticSearch Mappings Explanation: [Mapping Types](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html) + +### Configure Settings +Index settings control the properties of the index. The most commonly used property is `analysis`, which allows you to customize the analyzers for the index. The analyzers defined here can be used by fields in the mappings. + +ElasticSearch Settings API: [Update Settings](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html) + +ElasticSearch related explanation: +- [Analyzer Concepts](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-concepts.html) +- [Specifying Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html) + +### Analyzer Support +Analyzers can be configured as default when creating an index, or they can be set for specific fields. (See the previous section for related concepts from the ES documentation.) + +SeaSearch supports the following analyzers, which can be found here: [ZincSearch Documentation](https://zincsearch-docs.zinc.dev/api/index/analyze/). The concepts such as tokenization and token filters are consistent with ES and support most of the commonly used analyzers and tokenizers in ES. + +### Chinese Analyzer +To enable the Chinese analyzer in the system, set the environment variable `ZINC_PLUGIN_GSE_ENABLE=true`. + +If you need more comprehensive support for Chinese word dictionaries, set `ZINC_PLUGIN_GSE_DICT_EMBED = BIG`. + +`GSE` is a standard analyzer, so you can directly assign the Chinese analyzer to fields in the mappings: +``` +PUT /es/my-index/_mappings +{ + "properties": { + "content": { + "type": "text", + "analyzer": "gse_standard" + } + } +} +``` +If users have custom tokenization habits, they can specify their dictionary files by setting the environment variable `ZINC_PLUGIN_GSE_DICT_PATH=${DICT_PATH}`, where `DICT_PATH` is the actual path to the dictionary files. The `user.txt` file contains the dictionary, and the `stop.txt` file contains stop words. Each line contains a single word. + +GSE will load the dictionary and stop words from this path and use the user-defined dictionary to segment Chinese sentences. diff --git a/manual/api/overview.md b/manual/api/overview.md new file mode 100644 index 0000000..5ffb4ba --- /dev/null +++ b/manual/api/overview.md @@ -0,0 +1,11 @@ +# Overview +SeaSearch is developed based on ZincSearch and is compatible with ElasticSearch (ES) APIs. The concepts used in the API are similar to those in ElasticSearch, so users can directly refer to the [ElasticSearch API documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html) and [ZincSearch API documentation](https://zincsearch-docs.zinc.dev/api-es-compatible/) for most API calls. This document introduces the commonly used APIs to help users quickly understand the main concepts and basic usage flow. It will also explain the modifications we made to the ZincSearch API and highlight the differences from the upstream API. + +The ES-compatible APIs provided by SeaSearch can be accessed by adding the /es/ prefix in the URL. For example, the ES API URL is: +``` +GET /my-index-000001/_search +``` +The corresponding SeaSearch API URL is: +``` +GET /es/my-index-000001/_search +``` diff --git a/manual/api/search_document.md b/manual/api/search_document.md new file mode 100644 index 0000000..de9c780 --- /dev/null +++ b/manual/api/search_document.md @@ -0,0 +1,28 @@ +## Search Documents +### Query DSL +To perform full-text search, use the DSL. For usage, refer to: + +[Query DSL Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) + +We do not support all query parameter options provided by ES. Unsupported parameters include: indices_boost, knn, min_score, retriever, pit, runtime_mappings, seq_no_primary_term, stats, terminate_after, version. + +Search API: [Search API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html) + +### Delete by Query +To delete documents based on a query, use the delete-by-query operation. Like search, we do not support some ES parameters. + +ElasticSearch API: [Delete by Query](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html) + +### Multi-Search +Multi-search supports searching multiple indexes and running different queries on each index. + +ElasticSearch API: [Multi-Search API Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html) + +We extended the multi-search to support using the same scoring information across different indexes for more accurate score calculation. To enable this, set `unify_score=true` in the query. + +`unify_score` is meaningful only in this scenario: when searching the same query across multiple indexes. For example, in Seafile, we create an index for each library. When globally searching across all accessible libraries, enabling unify_score ensures consistent scoring across different repositories, providing more accurate search results. +``` +[POST] /es/_msearch?unify_score=true +{"index": "t1"} +{"query": {"bool": {"should": [{"match": {"filename": {"query": "数据库", "minimum_should_match": "-25%"}}}, {"match": {"filename.ngram": {"query": "数据库", "minimum_should_match": "80 +``` diff --git a/manual/api/seasearch_api.md b/manual/api/seasearch_api.md deleted file mode 100644 index 5750b17..0000000 --- a/manual/api/seasearch_api.md +++ /dev/null @@ -1,496 +0,0 @@ - - -# API introduction - -SeaSearch uses Http Basic Auth for permission verification, and the API request needs to carry the corresponding token in the header. - -``` -# headers -{ - 'Authorization': 'Basic ' -} -``` - -## User management - -### Administrator user - -SeaSearch manages API permissions through accounts. When the program is started for the first time, an administrator account needs to be configured through environment variables. - -The following is an example of an administrator account: - -``` -set ZINC_FIRST_ADMIN_USER=admin -set ZINC_FIRST_ADMIN_PASSWORD=xxx -``` - -### Normal user - -Users can be created/updated via the API: - -``` -[POST] /api/user - -{ - "_id": "prabhat", - "name": "Prabhat Sharma", - "role": "admin", // or user - "password": "xxx" -} -``` - -get all users: - -``` -[GET] /api/user -``` - -delete user: - -``` -[DELETE] /api/user/${userId} -``` - -## Index related - -### create index - -Create a SeaSearch index, and you can set both mappings and settings at the same time. - -We can also set settings or mapping directly through other requests. If the index does not exist, it will be created automatically. - -SeaSearch documentation:[https://zincsearch-docs.zinc.dev/api/index/create/#update-a-exists-index](https://zincsearch-docs.zinc.dev/api/index/create/#update-a-exists-index) - -ES documentation:[https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) - -### Configure mappings - -Mappings define the rules for fields in a document, such as type, format, etc. - -Mapping can be configured via a separate API: - -SeaSearch api: [https://zincsearch-docs.zinc.dev/api-es-compatible/index/update-mapping/](https://zincsearch-docs.zinc.dev/api-es-compatible/index/update-mapping/) - -ES related instructions:[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html) - - -### Configure settings - -Settings set the analyzer sharding and other related settings of the index. - -SeaSearch api: [https://zincsearch-docs.zinc.dev/api-es-compatible/index/update-settings/](https://zincsearch-docs.zinc.dev/api-es-compatible/index/update-settings/) - -ES related instructions: - - * analyzer related concepts:[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-concepts.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-concepts.html) - - * How to specify an analyzer:[https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html) - -### Analyzer support - -Analyzer can configure the default when creating an index, or set it for a specific field. (Refer to the settings ES documentation in the previous section to understand the relevant concepts.) - -The analyzers supported by SeaSearch can be found on this page: [https://zincsearch-docs.zinc.dev/api/index/analyze/](https://zincsearch-docs.zinc.dev/api/index/analyze/). The concepts such as tokenize and token filter are consistent with ES, and most of the commonly used analyzers and tokenizers in ES are supported. - -Supported general analyzers - - * standard, the default analyzer. If not specified, this analyzer is used to split words and lowercase them. - - * simple, split according to non-letters (symbols are filtered), lowercase - - * keyword, no word segmentation, directly treat input as output - - * stop, lowercase, stop word filter (the, a, is, etc.) - - * web, implemented by Bluge, matching email addresses, urls, etc. Handling lowercase, using stop word filters - - * regexp/pattern, regular expression, default is \W+ (non-character segmentation), supports lowercase and stop words - - * whitespace, split by space, do not convert to lowercase - - -### Luanguages analyzers - -| Country | Shortened form | -| -------------- | -------------- | -| arabic | ar | -| Asia Countries | cjk | -| sorani | ckb | -| danish | da | -| german | de | -| english | en | -| spanish | es | -| persian | fa | -| finnish | fi | -| french | fr | -| hindi | hi | -| hungarian | hu | -| italian | it | -| dutch | nl | -| norwegian | no | -| portuguese | pt | -| romanian | ro | -| russian | ru | -| swedish | sv | -| turkish | tr | - - -Chinese analyzer: - - * gse_standard, use the shortest path algorithm to segment words - - * gse_search, the search engine's word segmentation mode provides as many keywords as possible - -The Chinese analyzer uses the [gse](https://github.com/go-ego/gse) library to implement word segmentation. It is a Golang implementation of the Python stammer library. It is not enabled by default and needs to be enabled through environment variables. - -``` -ZINC_PLUGIN_GSE_ENABLE=true -# true: enable Chinese word segmentation support, default is false - -ZINC_PLUGIN_GSE_DICT_EMBED=BIG -# BIG: use the gse built-in vocabulary and stop words; otherwise, use the SeaSearch built-in simple vocabulary, the default is small - -ZINC_PLUGIN_GSE_ENABLE_STOP=true -# true: use stop words, default true - -ZINC_PLUGIN_GSE_ENABLE_HMM=true -# Use HMM mode for search word segmentation, default is true - -ZINC_PLUGIN_GSE_DICT_PATH=./plugins/gse/dict -# To use a user-defined word library and stop words, you need to put the content in the configured path, and name the word library user.txt and the stop words stop.txt -``` - - -## Full text search - -### document CRUD - -create document: - -SeaSearch API: [https://zincsearch-docs.zinc.dev/api-es-compatible/document/create/](https://zincsearch-docs.zinc.dev/api-es-compatible/document/create/) - -ES API:[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html) - -update document: - -SeaSearch API: [https://zincsearch-docs.zinc.dev/api-es-compatible/document/update/](https://zincsearch-docs.zinc.dev/api-es-compatible/document/update/) - -ES API: [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html) - -delete document: - -SeaSearch API: [https://zincsearch-docs.zinc.dev/api-es-compatible/document/delete/](https://zincsearch-docs.zinc.dev/api-es-compatible/document/delete/) - -ES API: [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html) - -Get document by id: - -``` -[GET] /api/${indexName}/_doc/${docId} -``` - -### Batch Operation - -Batch operations should be used to update indexes whenever possible. - -SeaSearch API: [https://zincsearch-docs.zinc.dev/api-es-compatible/document/bulk/#request](https://zincsearch-docs.zinc.dev/api-es-compatible/document/bulk/#request) - -ES API:[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) - - -### search - -API examples: - -[https://zincsearch-docs.zinc.dev/api-es-compatible/search/search/](https://zincsearch-docs.zinc.dev/api-es-compatible/search/search/) - -Full-text search uses DSL. For usage, please refer to: - -[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) - -delete-by-query:Delete based on query - -``` -[POST] /es/${indexName}/_delete_by_query - -{ - "query": { - "match": { - "name": "jack" - } - } -} -``` - -ES API: [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html) - -multi-search,supports executing different queries on different indexes: - -SeaSearch API: [https://zincsearch-docs.zinc.dev/api-es-compatible/search/msearch/](https://zincsearch-docs.zinc.dev/api-es-compatible/search/msearch/) - -ES API: [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html) - -We have extended multi-search to support using the same statistics when searching different indexes to make the score calculation more accurate. You can enable it by setting query: unify_score=true in the request. - -``` -[POST] /es/_msearch?unify_score=true - -{"index": "t1"} -{"query": {"bool": {"should": [{"match": {"filename": {"query": "test string", "minimum_should_match": "-25%"}}}, {"match": {"filename.ngram": {"query": "test string", "minimum_should_match": "80%"}}}], "minimum_should_match": 1}}, "from": 0, "size": 10, "_source": ["path", "repo_id", "filename", "is_dir"], "sort": ["_score"]} -{"index": "t2"} -{"query": {"bool": {"should": [{"match": {"filename": {"query": "test string", "minimum_should_match": "-25%"}}}, {"match": {"filename.ngram": {"query": "test string", "minimum_should_match": "80%"}}}], "minimum_should_match": 1}}, "from": 0, "size": 10, "_source": ["path", "repo_id", "filename", "is_dir"], "sort": ["_score"]} -``` - - -## Vector search - -We have developed a vector search function for the SeaSearch extension. The following is an introduction to the relevant API. - -### Create vector search - -To use the vector search function, you need to create a vector index in advance, which can be done through mapping. - -We create an index and set the vector field of the document data to be written to be called "vec", the index type is flat, and the vector dimension is 768 - -``` -[PUT] /es/${indexName}/_mapping - -{ -"properties":{ - "vec":{ - "type":"vector", - "dims":768, - "m":64, - "nbits":8, - "vec_index_type":"flat" - } - } -} -``` - -Parameter Description: - -``` -${indexName} zincIndex, index name - -type, fixed to vector, indicating vector index -dims, vector dimensions -m, ivf_pq index required parameters, need to be divisible by dims -nbits, ivf_pq index required parameter, default is 8 -vec_index_type, index type, supports two types: flat and ivf_pq -``` - -### Write a document containing a vector - - -There is no difference between writing a document containing a vector and writing a normal document at the API level. You can choose the appropriate method. - -The following takes the bluk API as an example - -``` -[POST] /es/_bulk - -body: - -{ "index" : { "_index" : "index1" } } -{"name": "jack1","vec":[10.2,10.41,9.5,22.2]} -{ "index" : { "_index" : "index1" } } -{"name": "jack2","vec":[10.2,11.41,9.5,22.2]} -{ "index" : { "_index" : "index1" } } -{"name": "jack3","vec":[10.2,12.41,9.5,22.2]} -``` - -Note that the _bulk API strictly requires the format of each line, and the data cannot exceed one line. For details, please refer to [ES bulk](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) - -Modification and deletion can also be done using bulk. After deleting a document, its corresponding vector data will also be deleted - - -### Retrieval vector - -By passing in a vector, we can search for N similar vectors in the system and return the corresponding document information: - -``` -[POST] /api/${indexName}/_search/vector - -body: -{ - { - "query_field":"vec", - "k":7, - "return_fields":["name"], - "vector":[10.2,10.40,9.5,22.2.......], - "_source":false - } -} -``` - -The API response format is the same as the full-text search format. - -The following is a description of the parameters: - -``` -${indexName} zincIndex, index name - -query_field, the field in the index to retrieve, the field must be of vector type -k, the number of K most similar vectors to return -return_fields, the name of the field to be returned individually -vector, the vector used for query -nprobe, only works for ivf_pq index type, the number of clusters to query, the higher the number, the more accurate -_source, it is used to control whether to return the _source field, supports bool or an array, describing which fields need to be returned - -``` - -### Rebuild index - -Rebuild the index immediately, suitable for situations where you don't need to wait for background automatic detection. - -``` -[POST] /api/:target/:field/_rebuild -``` - -### query recall - - -For vectors of type ivf_pq, recall checks can be performed on their data. - -``` -[POST] /api/:target/_recall -{ - "field":"vec_001", # Fields to test - "k":10, - "nprobe":5, # nprobe number - "query_count":1000 # Number of times the test was performed -} -``` - -# Vector search usage examples - -Next, we will demonstrate how to index a batch of papers. Each paper may contain multiple vectors that need to be indexed. We hope to obtain the most similar N vectors through vector retrieval, and thus obtain their corresponding paper-ids. - -## Creating SeaSearch indexes and vector indexes - -The first step is to set the mapping of the vector index. When setting the mapping, the index and vector index are automatically created. - -Since paper-id is just a normal string, we don't need to analyze it, so we set its type to keyword: - -``` -[PUT] /es/paper/_mapping - -{ -"properties":{ - "title-vec":{ - "type":"vector", - "dims":768, - "vec_index_type":"flat", - "m":1 - }, - "paper-id":{ - "type":"keyword" - } - } -} -``` - -Through the above request, we created an index named paper and established a flat vector index for the title-vec field of the index. - -## Index data - -We write these paper data to SeaSearch in batches through the _bulk API. - -``` -[POST] /es/_bulk - -{ "index" : {"_index" : "paper" } } -{"paper-id": "001","title-vec":[10.2,10.40,9.5,22.2....]} -{ "index" : {"_index" : "paper" } } -{"paper-id": "002","title-vec":[10.2,11.40,9.5,22.2....]} -{ "index" : {"_index" : "paper" } } -{"paper-id": "003","title-vec":[10.2,12.40,9.5,22.2....]} -.... - - -``` - -## Retrieving data - -Now we can retrieve it using the vector: - -``` -[POST] /api/paper/_search/vector - -{ - "query_field":"title-vec", - "k":10, - "return_fields":["paper-id"], - "vector":[10.2,10.40,9.5,22.2....] -} -``` - -The document corresponding to the most similar vector can be retrieved, and the paper-id can be obtained. Since a paper may contain multiple vectors, if multiple vectors of a paper are very similar to the query vector, then this paper-id may appear multiple times in the results. - -## Maintaining vector data - -### Update the document directly - -After a document is successfully imported, SeaSearch will return its doc id. We can directly update a document based on the doc id: - -``` -[POST] /es/_bulk - -{ "update" : {"_id":"23gZX9eT6QM","_index" : "paper" } } -{"paper-id": "005","vec":[10.2,1.43,9.5,22.2...]} -``` - -### Query first and then update - -If the returned doc id is not saved, you can first use SeaSearch's full-text search function to query the documents corresponding to paper-id: - -``` -[POST] /es/paper/_search - -{ - "query": { - "bool": { - "must": [ - { - "term": {"paper-id":"003"} - } - ] - } - } -} -``` - -Through DSL, we can directly retrieve the document corresponding to the paper-id and its doc id. - -### Fully updated paper - -A paper contains multiple vectors. If a vector needs to be updated, we can directly update the document corresponding to the vector. However, in actual applications, it is not easy to distinguish which contents of a paper are newly added and which are updated. - -We can adopt the method of full update: - - * First, query all documents of a paper through DSL - - * Delete all documents - - * Import the latest paper data - -Steps 2 and 3 can be performed in one batch operation. - -The following example will demonstrate deleting the document of paper 001 and re-importing it; at the same time, directly updating paper 005 and paper 006 because they only have one vector: - -``` -[POST] /es/_bulk - - -{ "index" : {"_index" : "paper" } } -{"paper-id": "001","title-vec":[10.2,10.40,9.5,22.2....]} -{ "index" : {"_index" : "paper" } } -{"paper-id": "002","title-vec":[10.2,11.40,9.5,22.2....]} -{ "index" : {"_index" : "paper" } } -{"paper-id": "003","title-vec":[10.2,12.40,9.5,22.2....]} -.... - - -``` - diff --git a/manual/config/README.md b/manual/config/README.md index ca0865c..de5cc59 100644 --- a/manual/config/README.md +++ b/manual/config/README.md @@ -1,42 +1,106 @@ # SeaSearch Configuration +For the official ZincSearch configuration, refer to: [ZincSearch Official Documentation](https://zincsearch-docs.zinc.dev/environment-variables/). -## Single-Node Configurations +The following configuration options are the ones we’ve extended. All configurations are set via environment variables. -### Basic Configurations +## Local Storage +- `SS_DATA_PATH`: Local storage path (default ./data). This is a required option and will be used as the SeaSearch system storage path (replaces the original `ZINC_DATA_PATH`). + +## Object Storage +- `SS_STORAGE_TYPE`: Type of storage medium, default is disk. Possible options are s3, oss. +- `SS_MAX_OBJ_CACHE_SIZE`: When using object storage, the maximum local cache size. Default is 10GB. Support human-friendly storage formats, e.g. `500MB`, `12GB` +- `SS_DATA_PATH`: Local storage path (default ./data). This is a required option and will be used for local cache storage when using object storage. -```shell -# log mode of gin framework,default release -ZINC_WAL_ENABLE=true +### S3 +These configurations are only effective when `SS_STORAGE_TYPE=s3`. -# type of storage's engine, i.e., s3 -ZINC_STORAGE_TYPE= +- `SS_S3_ACCESS_ID`: The `SS_S3_ACCESS_ID` is required to authenticate you to S3. You can find the `SS_S3_ACCESS_ID` in the "security credentials" section on your AWS account page or from your storage provider. +- `SS_S3_USE_V4_SIGNATURE`: There are two versions of authentication protocols that can be used with S3 storage: Version 2 (older, may still be supported by some regions) and Version 4 (current, used by most regions). If you don't set this option, SeaSearch will use the v2 protocol. It's suggested to use the v4 protocol. +- `SS_S3_ACCESS_SECRET`: The `SS_S3_ACCESS_SECRET` is required to authenticate you to S3. You can find the key in the "security credentials" section on your AWS account page or from your storage provider. +- `SS_S3_ENDPOINT`: (Optional) The endpoint by which you access the storage service. Usually it starts with the region name. It's required to provide the host address if you use storage provider other than AWS, otherwise SeaSearch will use AWS's address (i.e., `s3.us-east-1.amazonaws.com`). +- `SS_S3_BUCKET`: Bucket name for SeaSearch storage. Make sure it follows [S3 naming rules](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html#bucketnamingrules) (you can refer the notes below the table). +- `SS_S3_USE_HTTPS`: Use https to connect to S3. It's recommended to use https. +- `SS_S3_PATH_STYLE_REQUEST`: (Optional) This option asks SeaSearch to use URLs like `https://192.168.1.123:8080/bucketname/object` to access objects. In Amazon S3, the default URL format is in virtual host style, such as `https://bucketname.s3.amazonaws.com/object`. But this style relies on advanced DNS server setup. So most self-hosted storage systems only implement the path style format. So we recommend to set this option to true for self-hosted storage. +- `SS_S3_AWS_REGION`: (Optional) If you use the v4 protocol and AWS S3, set this option to the region you chose when you create the buckets. If it's not set and you're using the v4 protocol, SeaSearch will use `us-east-1` as the default. This option will be ignored if you use the v2 protocol. +- `SS_S3_SSE_C_KEY`(Optional) A string of 32 characters can be generated by `openssl rand -base64 24`. It can be any 32-character long random string. It's required to use V4 authentication protocol and https if you enable SSE-C. -# the number of shards, since seaseach has one index per database, in order to improve loading efficiency, the default value is changed to 1 -ZINC_SHARD_NUM=1 -``` - -### S3 Storage Configurations - -To enable s3 storage configurations, the term `ZINC_STORAGE_TYPE` has to be set as `ZINC_STORAGE_TYPE=s3`. +## Logging +- `SeaSearch_LOG_TO_STDOUT`: Whether to output logs to standard output as part of the SeaSearch component (default false). +- `SEATABLE_LOG_TO_STDOUT`: Whether to output logs to standard output as part of the Seatable component (default false). +- `SS_LOG_DIR`: Log directory (default is a log subdirectory in the current directory). +- `SS_LOG_LEVEL`: Log level (default is debug). -```shell -# the maximum local cache file size -ZINC_MAX_OBJ_CACHE_SIZE= - -# S3 relative informations -ZINC_S3_ACCESS_ID= -ZINC_S3_USE_V4_SIGNATURE= -ZINC_S3_ACCESS_SECRET= -ZINC_S3_ENDPOINT= -ZINC_S3_USE_HTTPS= -ZINC_S3_PATH_STYLE_REQUEST= -ZINC_S3_AWS_REGION= +## Example SeaSearch Configuration +### Enabling Local Disk as Storage Backend +``` sh + ZINC_FIRST_ADMIN_USER=admin + ZINC_FIRST_ADMIN_PASSWORD=password + SS_DATA_PATH=./data ``` -## Logs Configurations - -```shell -ZINC_LOG_OUTPUT=true #whether to output logs to files, default yes -ZINC_LOG_DIR=/opt/seasearch/data/log #log directory -ZINC_LOG_LEVEL=debug #log level,default debug +### Enabling S3 as Storage Backend +=== "AWS" +``` sh + ZINC_FIRST_ADMIN_USER=admin + ZINC_FIRST_ADMIN_PASSWORD=password + SS_DATA_PATH=./data + SS_STORAGE_TYPE=s3 + SS_S3_ACCESS_ID= + SS_S3_ACCESS_SECRET= + SS_S3_BUCKET= + SS_S3_REGION=us-east-1 + SS_S3_USE_HTTPS=true + SS_S3_USE_V4_SIGNATURE=true +``` +=== "Exoscale" +``` sh + ZINC_FIRST_ADMIN_USER=admin + ZINC_FIRST_ADMIN_PASSWORD=password + SS_DATA_PATH=./data + SS_STORAGE_TYPE=s3 + SS_S3_ACCESS_ID= + SS_S3_ACCESS_SECRET= + SS_S3_BUCKET= + SS_S3_ENDPOINT=sos-de-fra-1.exo.io + SS_S3_PATH_STYLE_REQUEST=true +``` +=== "Hetzner" +``` sh + ZINC_FIRST_ADMIN_USER=admin + ZINC_FIRST_ADMIN_PASSWORD=password + SS_DATA_PATH=./data + SS_STORAGE_TYPE=s3 + SS_S3_ACCESS_ID= + SS_S3_ACCESS_SECRET= + SS_S3_BUCKET= + SS_S3_ENDPOINT=fsn1.your-objectstorage.com + SS_S3_PATH_STYLE_REQUEST=true + SS_S3_USE_HTTPS=true +``` +=== "Other Public Hosted S3 Storag" +```sh + ZINC_FIRST_ADMIN_USER=admin + ZINC_FIRST_ADMIN_PASSWORD=password + SS_DATA_PATH=./data + SS_STORAGE_TYPE=s3 + SS_S3_ACCESS_ID= + SS_S3_ACCESS_SECRET= + SS_S3_BUCKET= + SS_S3_ENDPOINT= + SS_S3_REGION= + SS_S3_USE_HTTPS=true +``` +=== "Self-hosted S3 Storage" +```sh + ZINC_FIRST_ADMIN_USER=admin + ZINC_FIRST_ADMIN_PASSWORD=password + SS_DATA_PATH=./data + SS_STORAGE_TYPE=s3 + SS_S3_ACCESS_ID= + SS_S3_ACCESS_SECRET= + SS_S3_BUCKET= + SS_S3_ENDPOINT=: + SS_S3_USE_HTTPS=true + SS_S3_PATH_STYLE_REQUEST=true + SS_S3_USE_HTTPS=true ``` diff --git a/mkdocs.yml b/mkdocs.yml index 3c02939..4193016 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -48,4 +48,8 @@ nav: - Introduction: README.md - Deploy: deploy/README.md - Configuration: config/README.md - - SeaSearch API: api/seasearch_api.md + - SeaSearch API: api/overview.md + - API Authentication: api/authentication.md + - Index management: api/index_management.md + - Document Operation: api/document_operation.md + - Search Document: api/search_document.md