You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ticket is the result of two weeks of experiments.
I'll try to put all the information because It might be something wrong with RestHighLevelClient doing deleteByQuery.
I have been two weeks betting that it should be a problem on my side or a problem on Elastic (performance, configuration) but after several experiments I need to present this to you because I have no explanation.
First of all, I have prior knowledge of Elastic and I am aware that updates and deletes are expensive operations, this is not about that.
CONTEXT
This is a spring-boot microservice running on Java 11 using spring-data-elasticsearch 4.2.11 to run operations on elastic cluster.
We are on pre-launch experiments and I have an environment that mirrors our production traffic but allows me total control of it.
We have a lot of ingest operations, a lot of query operations, some significant rate of update operations, and few delete operations.
We are using RestHighLevelClient configured like this:
public RestHighLevelClient elasticsearchClient() {
final HttpHeaders compatibilityHeaders = new HttpHeaders();
compatibilityHeaders.add("Accept", "application/vnd.elasticsearch+json;compatible-with=7");
compatibilityHeaders.add("Content-Type", "application/vnd.elasticsearch+json;"
+ "compatible-with=7");
final ClientConfiguration clientConfiguration = ClientConfiguration.builder()
.connectedTo(eshostname + ":" + esport)
.usingSsl()
.withBasicAuth(username, password)
.withDefaultHeaders(compatibilityHeaders)
.build();
return RestClients.create(clientConfiguration).rest();
}
As said we do many ingest and query operations ... as example:
final BoolQueryBuilder boolQuery = QueryBuilders
.boolQuery()
.filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
.filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2))
.filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lte(s3));
final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
nsq.addSort(Sort.by(Direction.DESC, CREATED_SEARCH_FIELD));
nsq.setMaxResults(size);
We do also updateByQuery operations ... like this:
final BoolQueryBuilder boolQuery = QueryBuilders.boolQuery()
.filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
.filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lt(s3))
.filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2));
final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
return UpdateQuery.builder(nsq)
.withScriptType(ScriptType.INLINE)
.withScript(UPDATE_SCRIPT)
.withParams(UPDATE_PARAMS)
.build();
finally we do deleteByQuery operations with the same query as update operations.
Of course no script in that case.
ISSUE
All operations run like a charm except deleteByQuery. At the moment deleteByQuery is enabled (even these being just a fraction of the traffic, even when there are much more UPDATE operations) the cluster starts to get into problems. ALL delete operations timeout, although the records are removed from the cluster. The fielddata cache starts to grow significantly, eventually causing the GC usage and duration to spike, eventually causing the CPU to spike, and finally causing the circuit breaker [parent] to be triggered starting to respond 429 TOO MANY REQUEST to our operations.
This is no matter of the size of the result of the delete query, delete queries bringing just 1 o 2 documents cause the same effect.
Please remember that the amount of deleted queries is small.
This only happens on deletes. If I replace deletes with updates (using the same query and a script that updates four fields) the cluster is stable. This alone is very weird to me since updates are expected to be more expensive than updates.
NOTE If I bypass spring-data-elasticsearch and use feign client sending POST HTTP requests directly without the RestHighLevelClient for the delete operations, then the cluster is stable. This leads me to think that there might be something wrong with the deletes that RestHighLevelClient is sending. It feels like something is not closed (connection timeout).
Here are some screenshots:
Timeout exception on ALL delete operations
org.springframework.dao.DataAccessResourceFailureException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]; nested exception is java.lang.RuntimeException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]
at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:75)
at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:402)
at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:385)
at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.delete(ElasticsearchRestTemplate.java:224)
at com.xxx.xxx.service.xxx.deleteByQuery(xxx.java:380)
Metrics when deletes are enabled
(we disable updates at the same time so 100% of the spikes are related to deletes)
The text was updated successfully, but these errors were encountered:
Might be worth to try an add an intercepting proxy to the setup to capture the exact request that is sent out by the delete by query request.
Spring Data Elasticsearch 4.2 is outdated and out of maintenance for over one year now. The last version of the 4.x releases (4.4.x) has reached EOL last week.
When looking at the code in the 5.0 branch that uses the then already deprecated RestHighLevelClient I can see that the refresh parameter for the delete request is set to true, that might be causing the problem.
Can you reproduce this in a setup using the maintained versions (5.1 or 5.0), they both still allow the old client to be used, or better, can you switch to a supported version and use the current Elasticsearch client?
yeah ... if refresh parameter is set then I can understand that every delete request might be triggering an index refresh which is very likely the reason for the overload. I don't know the relation between the index refresh operation and the fielddata cache, but that's something from the elastic side.
I would try the maintained versions, but if the refresh param is there I would expect the same behavior. I will stay on feign for deletes until I am ready to switch to the current Elasticsearch client.
This ticket is the result of two weeks of experiments.
I'll try to put all the information because It might be something wrong with RestHighLevelClient doing deleteByQuery.
I have been two weeks betting that it should be a problem on my side or a problem on Elastic (performance, configuration) but after several experiments I need to present this to you because I have no explanation.
First of all, I have prior knowledge of Elastic and I am aware that updates and deletes are expensive operations, this is not about that.
CONTEXT
We are using RestHighLevelClient configured like this:
As said we do many ingest and query operations ... as example:
We do also updateByQuery operations ... like this:
update script looks like this:
finally we do deleteByQuery operations with the same query as update operations.
Of course no script in that case.
ISSUE
All operations run like a charm except deleteByQuery. At the moment deleteByQuery is enabled (even these being just a fraction of the traffic, even when there are much more UPDATE operations) the cluster starts to get into problems. ALL delete operations timeout, although the records are removed from the cluster. The fielddata cache starts to grow significantly, eventually causing the GC usage and duration to spike, eventually causing the CPU to spike, and finally causing the circuit breaker [parent] to be triggered starting to respond 429 TOO MANY REQUEST to our operations.
This is no matter of the size of the result of the delete query, delete queries bringing just 1 o 2 documents cause the same effect.
Please remember that the amount of deleted queries is small.
This only happens on deletes. If I replace deletes with updates (using the same query and a script that updates four fields) the cluster is stable. This alone is very weird to me since updates are expected to be more expensive than updates.
NOTE If I bypass spring-data-elasticsearch and use feign client sending POST HTTP requests directly without the RestHighLevelClient for the delete operations, then the cluster is stable. This leads me to think that there might be something wrong with the deletes that RestHighLevelClient is sending. It feels like something is not closed (connection timeout).
Here are some screenshots:
Timeout exception on ALL delete operations
Metrics when deletes are enabled
(we disable updates at the same time so 100% of the spikes are related to deletes)
The text was updated successfully, but these errors were encountered: