Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute #29

aarora79 · 2024-02-17T17:05:23Z

It would be interesting to see the effect of scaling to multiple instances behind the same endpoint. How does inference latency change as endpoints start to scale (automatically, we could also add parameters for scaling policy), can we support the more transactions with auto-scaling instances while keeping the latency below a threshold and what are the cost implications of doing that. This needs to be fleshed out but this is an interesting area.

This would also need to include support for the Inference Configuration feature that is now available with SageMaker.

aarora79 added the enhancement New feature or request label Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute #29

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute #29

aarora79 commented Feb 17, 2024

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute #29

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute #29

Comments

aarora79 commented Feb 17, 2024