Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure SLOs across whole suite, not single tests #779

Open
oxddr opened this issue Sep 6, 2019 · 16 comments
Open

Measure SLOs across whole suite, not single tests #779

oxddr opened this issue Sep 6, 2019 · 16 comments
Labels
area/slo lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@oxddr
Copy link
Contributor

oxddr commented Sep 6, 2019

Currently we measure all SLOs per-test. We think about measuring it across whole testing suite (density + load).

Measuring SLO across performance suite will increase number of windows (as defined in SLO description). Single bad request will have a smaller chances to sink the whole tests. Currently we see test flakiness cause by a single request (e.g. kubernetes/kubernetes#82377). Also this would put us closer to intention behind the two-level SLO, which is defined per cluster-day.

Implementation wise, this would involve merging density and load test into a single test and moving some measurements to the very end of it.

/area slo

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2019

Implementation wise, this would involve merging density and load test into a single test and moving some measurements to the very end of it.

Alternatively (I'm actually leaning towards saying that it would be better, but I'm happy to hear arguments against) we could:

  • add a concept of measurement that is run before the first test case and then after the last
    [conceptually BeforeSuite and AfterSuite from ginkgo]
  • start and gather all SLOs there

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2019

I think that we can even hardcode the set of things that we start there - it would also visibly simplify the configs...

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2019

@mm4tt - FYI

@oxddr
Copy link
Contributor Author

oxddr commented Sep 6, 2019

I like the new per-suite measurements you propose. This is indeed a better, long-term solution. Mine was suppose to be rather a hacky prototype to validate the concept.

However I'd rather avoid hard-coding any set of measurements, this would decrease the flexibility clusterloader2 gives us. But I'm open for debate.

@mm4tt
Copy link
Contributor

mm4tt commented Sep 6, 2019

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2019

However I'd rather avoid hard-coding any set of measurements, this would decrease the flexibility clusterloader2 gives us. But I'm open for debate.

I'm not going to push hard for this. My main goal is just to simplify the configs.
So the next iteration of that proposal is:

  1. opaque the SLO-based measurement (api-call-latencies, network-programming, ...) into single "SLO" measurement [we can't yet add pod-startup-time, but we hopefully will be in the future]
    [this is similar to what we did for TestMetrics]

  2. Make that a default measurement for Before/After suite.

@wojtek-t
Copy link
Member

wojtek-t commented Sep 6, 2019

Could it become part of the TestSuite - https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/test_suite_design.md ?

yes - it perfectly fits there

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2019
@wojtek-t
Copy link
Member

wojtek-t commented Dec 5, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2020
@mm4tt
Copy link
Contributor

mm4tt commented Mar 5, 2020

Closing this one in favor of #1007
While #1007 it's not addressing exactly the same problem and reasoning for doing it are different, it will make this FR obsolete.

/close

@k8s-ci-robot
Copy link
Contributor

@mm4tt: Closing this issue.

In response to this:

Closing this one in favor of #1007
While #1007 it's not addressing exactly the same problem and reasoning for doing it are different, it will make this FR obsolete.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@oxddr
Copy link
Contributor Author

oxddr commented Mar 5, 2020

/reopen

While we may not want to do this in the short-term, I'd like to keep this one open. The issues, you've mentioned means we avoid the problem, but not solving it. Where the problem i lack of functionality in clusterloader. We've added test for tokens recently, which is measured separately. We may add more in the future. We should be able to do measurements across the suite.

@k8s-ci-robot k8s-ci-robot reopened this Mar 5, 2020
@k8s-ci-robot
Copy link
Contributor

@oxddr: Reopened this issue.

In response to this:

/reopen

While we may not want to do this in the short-term, I'd like to keep this one open. The issues, you've mentioned means we avoid the problem, but not solving it. Where the problem i lack of functionality in clusterloader. We've added test for tokens recently, which is measured separately. We may add more in the future. We should be able to do measurements across the suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 4, 2020
@wojtek-t
Copy link
Member

wojtek-t commented Apr 4, 2020

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/slo lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants