High P99 Latency with tsoRequestDispatcher #1257

sptuan · 2024-04-01T14:36:16Z

Hi everyone. Thanks for the contribution of tikv-client. We encounter some performance problem, but seems peculiar. Here are some details.

If our understanding is incorrect, we kindly request your generous guidance and correction.

Background

Our project utilizes tikv as a distributed KV engine to build the metadata service. We have observed a significant rise in P99 latency during peak usage. (tikv region num ~ 5,000,000)

We tried many optimizations on the tikv server and OS, including adjustments to grpc-related settings, raft thread control, and rocksdb configurations. However, the improvements were not satisfactory. We sought advice from the community, as mentioned in https://asktug.com/t/topic/1011036.

Then we accidentally discovered that scaling up the number of instances of our own service (i.e., tikv-client) significantly improved system throughput and latency.

However, we are puzzled: why does scaling horizontally prove effective despite seemingly low resource utilization? Is it possible that individual tikv-client instances have some sort of bottleneck (such as a lock), limiting their capacity.

We did use 10 instances on 10 64core bare-metal server. tikv-client version is a bit older, 2.0.0-rc. But we did not perceive any changes in this regard.

Source Code

Each batch of TSO (Timestamp Oracle) get requests has a maximum of 10,000 requests. The size of the tsoRequestCh channel is set to 20,000. There is only one goroutine in the handlerDispatcher that sequentially handles all requests with types 2, 3, 4, and 5.

When there is a large number of TSO get requests, it may become a performance bottleneck due to:

Merging thousands of requests for sequential processing.
Synchronously waiting for stream send and recv operations.
Sequentially invoking callback functions for req.done.

Discovery

We observe some metrics in tikv-client.

`pd_client_request_handle_requests_duration_seconds_bucket{ type="tso"}`

the duration of pure TSO (Timestamp Oracle) stream.send and stream.recv operations, which is the latency of a single RPC request to PD for TSO.
This latency remains around 1ms consistently, regardless of scaling. It can be used to assess any fluctuations in the network between the client and PD or high load on PD.
Corresponds to the yellow section in the graph.

handle_cmds_duration

`pd_client_cmd_handle_cmds_duration_seconds_bucket{type="wait"}`

It represents the time taken for a request to be received by the dispatcher, and then blocked until it receives the response. This latency fluctuates significantly and decreases after scaling.

Corresponds to the green section in the graph.

As graph, we scale our service (with tikv-client) instance to 20. This indicates that scaling has a significant improvement on the red (waiting for tso req) and purple (callback req.done) sections. We did not scale tikv-server/pd.

Here are some other metrics:

Questions

We would like to discuss:

Why using 1 goroutine to process whole tikv-client tso. Does it need to maintain the order when TSO requests are collected in a go channel? If so why is it necessary to preserve the order?
tso request collect and done callback seem not be a heavy progress. It there any idea about why it has high P99 latency? It seems max tso QPS is 5000 for each tikv-client instance.
It there any best practice about deploying scale of tikv/tidb. For example, an instance with 64-core bare-metal seem not good enough. 4 instances on 64-core server seems better.

Please let me know if you need any further information. Thanks for your kindly help.

The text was updated successfully, but these errors were encountered:

MyonKeminta · 2024-11-27T05:25:54Z

Why using 1 goroutine to process whole tikv-client tso

It's unusual to see that the single dispatcher goroutine in the pd client becomes the bottleneck. This seem to be the reason that it's not expanded. The majority of the usage of client-go is in TiDB after all, and when the request OPS is too high, it's likely that the other part of TiDB would meet the bottleneck first.

Recently we've made an feature that supports processing multiple TSO request batches in parallel, but still based on one single dispatcher goroutine. In case PD's TSO allocation is not the bottleneck and the client's CPU is sufficient, this is possible to reduce the time spent on waiting for batching for each GetTS call (the blue part in your timeline diagram). But it's only available after v8.5.

Does it need to maintain the order when TSO requests are collected in a go channel?

If you are talking about the order of the requests collected in a batch, theoretically saying, the order itself is not important. We actually only need to guarantee the batch is allocated after sending the request and before receiving the result, and each ts in the batched result is globally unique. The way to collect the batch is not necessarily a channel. However, I didn't find out whether it can be optimized by abandoning the order of the requests.

tso request collect and done callback seem not be a heavy progress. It there any idea about why it has high P99 latency?

Sorry but we haven't dive into the P99 problem yet. But according to my speculation, it's related to the synchronizing operations in channels, context switches, and maybe go runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High P99 Latency with tsoRequestDispatcher #1257

High P99 Latency with tsoRequestDispatcher #1257

sptuan commented Apr 1, 2024 •

edited

Loading

MyonKeminta commented Nov 27, 2024

High P99 Latency with tsoRequestDispatcher #1257

High P99 Latency with tsoRequestDispatcher #1257

Comments

sptuan commented Apr 1, 2024 • edited Loading

Background

Source Code

Discovery

pd_client_request_handle_requests_duration_seconds_bucket{ type="tso"}

pd_client_cmd_handle_cmds_duration_seconds_bucket{type="wait"}

Questions

MyonKeminta commented Nov 27, 2024

sptuan commented Apr 1, 2024 •

edited

Loading

`pd_client_request_handle_requests_duration_seconds_bucket{ type="tso"}`

`pd_client_cmd_handle_cmds_duration_seconds_bucket{type="wait"}`