Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005

Open
mm4tt opened this issue Feb 3, 2020 · 19 comments
Open
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@mm4tt
Copy link
Contributor

mm4tt commented Feb 3, 2020

Original debugging done by @jkaniuk:

In 100 nodes OSS performance tests of 1.16:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-1.16-scalability-100

NodeKiller is consistently failing:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1219228169567997957

W0120 12:25:21.234] I0120 12:25:21.234558   12979 nodes.go:105] NodeKiller: Rebooting "e2e-big-minion-group-tt6r" to repair the node
W0120 12:25:24.556] I0120 12:25:24.555774   12979 ssh.go:38] ssh to "e2e-big-minion-group-tt6r" finished with "External IP address was not found; defaulting to using IAP > tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255
W0120 12:25:24.556] E0120 12:25:24.555839   12979 nodes.go:108] NodeKiller: Error while rebooting node "e2e-big-minion-group-tt6r": exit status 255
@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 3, 2020

Looks like it transiently fails in 1.16, meaning that some of the ssh calls succeed and some not (within a single run), e.g.

OK - W0129 12:36:05.537] I0129 12:36:05.536771 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\nWarning: Permanently added 'compute.691072012589517573' (ED25519) to the list of known hosts.\r\nWarning: Stopping docker.service, but it can still be activated by:\n docker.socket\n": <nil>

BAD - W0129 12:46:08.636] I0129 12:46:08.636013 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 3, 2020

I checked 3 runs of 1.17 test and the problem doesn't occur there. Seems to be 1.16 specific thing.

Maybe there is a different gcloud version used in 1.16 and 1.17?

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 3, 2020

I'd try upgrading the gcloud version in 1.16 test to see whether it helps

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 3, 2020

/assign

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 4, 2020

kubernetes/test-infra#16103 doesn't seem to be helping, let's revert it.

I took a deeper look a have a new theory now. It looks like in 1.17 runs there are no logs from chaosmonkey components. I believe that the error we see in 1.16 are actually expected, they are returned for reboot command which terminates the ssh connection. We don't see them in 1.17 because chaosmonkey doesn't work properly there for some reason. The thing that stands out is that in 1.17 we have this commit and we don't have it in 1.16.
This commit is also present in master, and there we also don't have any chaosmonkey logs.

I'd suggest adding more logging to nodes.go in master branch to see what is going on with the chasomonkey there.

@mm4tt mm4tt changed the title NodeKiller not working in 100 node 1.16 performance tests NodeKiller seems to be not working in 100 node 1.17 / master performance tests Feb 4, 2020
@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 4, 2020

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@mm4tt:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 4, 2020
@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 4, 2020

FTR, these are the chaosmonkey files that we could instrument better - https://github.com/kubernetes/perf-tests/tree/eb4fffb50d3caee11a57262b46286f051d9337fb/clusterloader2/pkg/chaos
Adding more verbose logging there (e.g. listing all the nodes chaosmonkey is operating on, logging when chaosmonkey attempts to kill a node, etc.) should help us debug this issue.

@jprzychodzen
Copy link
Contributor

/assign

@jprzychodzen
Copy link
Contributor

There are two different issues:

  • Nodes are not really randomly selected. Current selecting mechanism with current configuration does not select any node to kill if ( failure rate ) * ( number of eligible nodes ) < 1. While filtering out nodes running Prometheus we are getting less than 100 nodes. Current failure rate assume that 0.01 nodes will fail. This multiplies to a number lower than one and never selects any node to kill.
  • Response 255 to SSH is just a non-gracefully closed connection to a node after reboot is executed.

@jprzychodzen
Copy link
Contributor

Fixed.

@jprzychodzen
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@jprzychodzen: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mm4tt
Copy link
Contributor Author

mm4tt commented Mar 30, 2020

/reopen

#1140

@k8s-ci-robot k8s-ci-robot reopened this Mar 30, 2020
@k8s-ci-robot
Copy link
Contributor

@mm4tt: Reopened this issue.

In response to this:

/reopen

#1140

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2020
@jkaniuk
Copy link
Contributor

jkaniuk commented Jun 29, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2020
@wojtek-t
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants