Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce WatchListLatencyPrometheus measurement #2315

Conversation

p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented Sep 11, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

WatchListLatencyPrometheus measurement gathers 50th, 90th and 99th duration quantiles for watch list requests broken down by group, resource, scope.

The new metric (kubernetes/kubernetes#120490) allows for comparing watch-list requests with standard list requests and measuring performance of the new requests in general.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

xref: kubernetes/enhancements#3157

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 11, 2023
@p0lyn0mial p0lyn0mial force-pushed the upstream-watch-list-latency-measurment branch 2 times, most recently from b9e3f4f to c61f100 Compare September 11, 2023 13:49
@p0lyn0mial p0lyn0mial changed the title WIP: introduce WatchListLatencyPrometheus measurment introduce WatchListLatencyPrometheus measurement Sep 11, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 11, 2023
@p0lyn0mial
Copy link
Contributor Author

/hold

we should wait for kubernetes/kubernetes#120490

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2023
Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You perhaps need to enable the feature gate somewhere no?

watchListLatencyPrometheusMeasurementName = "WatchListLatencyPrometheus"

// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size
watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_cache_watch_list_duration_seconds{}[%v])) by (group, resource, scope, le))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not forgot to add the version label here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks!

@dgrisonnet
Copy link
Member

dgrisonnet commented Sep 19, 2023

@p0lyn0mial
Copy link
Contributor Author

You perhaps need to enable the feature gate somewhere no?

I don't think so. I'm planning to use this measurement in the watchlist perf tests (#2316) which already setup the cluster to speak the streaming API.

@dgrisonnet
Copy link
Member

Isn't #2316 just a way to have the measurement displayed on the perf dashboard? Maybe it is already setup in test-infra, but I would have expected some code there to enable API streaming

@dgrisonnet
Copy link
Member

dgrisonnet commented Sep 20, 2023

Ah I see, seems like you did the work to setup watchlist already: kubernetes/test-infra#29604

// watchListLatencyGatherer gathers 50th, 90th and 99th duration quantiles
// for watch list requests broken down by group, resource, scope.
@p0lyn0mial p0lyn0mial force-pushed the upstream-watch-list-latency-measurment branch from c61f100 to bfc261b Compare September 25, 2023 13:30
@p0lyn0mial
Copy link
Contributor Author

/hold cancel

this PR is ready for review

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 28, 2023
watchListLatencyPrometheusMeasurementName = "WatchListLatencyPrometheus"

// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size
watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_list_duration_seconds{}[%v])) by (group, version, resource, scope, le))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go

can you suffix it with Simple?

basically this is simplified version of the slo and we should reflect it.


// watchListLatencyQuery placeholders must be replaced with (1) quantile (2) query window size
watchListLatencyQuery = "histogram_quantile(%.2f, sum(rate(apiserver_watch_list_duration_seconds{}[%v])) by (group, version, resource, scope, le))"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need "_bucket" at the end of metric name?
We're using it here:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L58

I would really prefer consistency between those two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - I started to wonder more generically - why can't we just reuse that other measurement that I linked?

I think we're effectively reimplementing the exact same logic and the only differences that we have are:
(1) we're using a different metric name
(2) the verb is always LIST

I think it should be possible to slightly refactor that other measurement and simply register two measurements there:
https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L81C3-L82C1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ApiResponsivenessGatherer differs in a few places.

First of all, it has two different modes for getting the latency metrics (simple and extended).
In addition to gathering the latency, it collects two additional metrics: count and countFast.
The internal data structures hold all three metrics.
Once the metrics are collected, it supports reading a custom threshold from the config, which is used for further validation.

I think that the refactoring would boil down to creating "generic simple latency metrics," which could potentially be reused by both implementations.

Given that the internal data structures differ, the existing implementation would have to incorporate the generic latency metric and extend it.

Is this what you had in mind ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need "_bucket" at the end of metric name?

Yes you should use the buckets with histogram_quantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p0lyn0mial - I'm a bit lost in your comment, so let me try to explain a bit deeper what I had in mind:

  1. yes, there are two modes (simple and "normal", but the difference between these two is only how we're sampling the metrics. To be more specific, this is the only difference between these two modes:
    https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/slos/api_responsiveness_prometheus.go#L176-L204

  2. From e2e user perspective, if I want to list my objects, it doesn't really matter if the server underneath is using the list method or the new watchlist protocol. I care about the latency of getting the result

  3. Because of (2), we generally don't want to introduce a separate measurement, it should actually be part of exactly the same measurement (same config, same threshold, ....)
    Although, initially we may want to split that a bit for debuggability reasons.

  4. So the way I think about what we should do is effectively:

Once we prove that, we should actually merge the Samples for list & watchlist together, but let's do that as a follow-up and just start by treating them as separate things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand, the new metric (watchlist) will end up on being reported as part of LoadResponsiveness_PrometheusSimple for all jobs! I like it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #2764

@wojtek-t wojtek-t self-assigned this Nov 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@p0lyn0mial
Copy link
Contributor Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Apr 23, 2024
@k8s-ci-robot
Copy link
Contributor

@p0lyn0mial: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: p0lyn0mial
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@p0lyn0mial
Copy link
Contributor Author

/reopen
/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot reopened this May 23, 2024
@k8s-ci-robot
Copy link
Contributor

@p0lyn0mial: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 23, 2024
@wojtek-t
Copy link
Member

The new PR is much better. Closing in favor of #2764

@wojtek-t wojtek-t closed this Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants