NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005

mm4tt · 2020-02-03T15:00:21Z

Original debugging done by @jkaniuk:

In 100 nodes OSS performance tests of 1.16:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-1.16-scalability-100

NodeKiller is consistently failing:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1219228169567997957

W0120 12:25:21.234] I0120 12:25:21.234558   12979 nodes.go:105] NodeKiller: Rebooting "e2e-big-minion-group-tt6r" to repair the node
W0120 12:25:24.556] I0120 12:25:24.555774   12979 ssh.go:38] ssh to "e2e-big-minion-group-tt6r" finished with "External IP address was not found; defaulting to using IAP > tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255
W0120 12:25:24.556] E0120 12:25:24.555839   12979 nodes.go:108] NodeKiller: Error while rebooting node "e2e-big-minion-group-tt6r": exit status 255

The text was updated successfully, but these errors were encountered:

mm4tt · 2020-02-03T15:01:12Z

Looks like it transiently fails in 1.16, meaning that some of the ssh calls succeed and some not (within a single run), e.g.

OK - W0129 12:36:05.537] I0129 12:36:05.536771 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\nWarning: Permanently added 'compute.691072012589517573' (ED25519) to the list of known hosts.\r\nWarning: Stopping docker.service, but it can still be activated by:\n docker.socket\n": <nil>

BAD - W0129 12:46:08.636] I0129 12:46:08.636013 10910 ssh.go:38] ssh to "e2e-big-minion-group-5l39" finished with "External IP address was not found; defaulting to using IAP tunneling.\npacket_write_wait: Connection to UNKNOWN port 65535: Broken pipe\r\nERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].\n": exit status 255

mm4tt · 2020-02-03T15:01:36Z

I checked 3 runs of 1.17 test and the problem doesn't occur there. Seems to be 1.16 specific thing.

Maybe there is a different gcloud version used in 1.16 and 1.17?

mm4tt · 2020-02-03T15:01:43Z

I'd try upgrading the gcloud version in 1.16 test to see whether it helps

mm4tt · 2020-02-03T15:01:49Z

/assign

Ref. kubernetes/perf-tests#1005

mm4tt · 2020-02-04T10:08:06Z

kubernetes/test-infra#16103 doesn't seem to be helping, let's revert it.

I took a deeper look a have a new theory now. It looks like in 1.17 runs there are no logs from chaosmonkey components. I believe that the error we see in 1.16 are actually expected, they are returned for reboot command which terminates the ssh connection. We don't see them in 1.17 because chaosmonkey doesn't work properly there for some reason. The thing that stands out is that in 1.17 we have this commit and we don't have it in 1.16.
This commit is also present in master, and there we also don't have any chaosmonkey logs.

I'd suggest adding more logging to nodes.go in master branch to see what is going on with the chasomonkey there.

mm4tt · 2020-02-04T10:08:39Z

/good-first-issue

k8s-ci-robot · 2020-02-04T10:08:40Z

@mm4tt:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mm4tt · 2020-02-04T10:13:04Z

FTR, these are the chaosmonkey files that we could instrument better - https://github.com/kubernetes/perf-tests/tree/eb4fffb50d3caee11a57262b46286f051d9337fb/clusterloader2/pkg/chaos
Adding more verbose logging there (e.g. listing all the nodes chaosmonkey is operating on, logging when chaosmonkey attempts to kill a node, etc.) should help us debug this issue.

This reverts commit f00041e. It didn't help, see kubernetes/perf-tests#1005

jprzychodzen · 2020-03-04T14:30:28Z

/assign

jprzychodzen · 2020-03-04T14:38:48Z

There are two different issues:

Nodes are not really randomly selected. Current selecting mechanism with current configuration does not select any node to kill if ( failure rate ) * ( number of eligible nodes ) < 1. While filtering out nodes running Prometheus we are getting less than 100 nodes. Current failure rate assume that 0.01 nodes will fail. This multiplies to a number lower than one and never selects any node to kill.
Response 255 to SSH is just a non-gracefully closed connection to a node after reboot is executed.

jprzychodzen · 2020-03-10T10:11:01Z

Fixed.

jprzychodzen · 2020-03-10T10:33:56Z

/close

k8s-ci-robot · 2020-03-10T10:33:58Z

@jprzychodzen: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mm4tt · 2020-03-30T13:22:03Z

/reopen

#1140

k8s-ci-robot · 2020-03-30T13:22:20Z

@mm4tt: Reopened this issue.

In response to this:

/reopen

#1140

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-06-28T14:21:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jkaniuk · 2020-06-29T07:32:11Z

/remove-lifecycle stale

fejta-bot · 2020-09-27T07:41:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2020-09-28T05:53:37Z

/remove-lifecycle stale
/lifecycle frozen

k8s-ci-robot assigned mm4tt Feb 3, 2020

mm4tt added a commit to mm4tt/test-infra that referenced this issue Feb 3, 2020

Use newest gcloud build in gce-cos-1.16-scalability-100 job

f00041e

Ref. kubernetes/perf-tests#1005

mm4tt mentioned this issue Feb 3, 2020

Use newest gcloud build in gce-cos-1.16-scalability-100 job kubernetes/test-infra#16103

Merged

mm4tt changed the title ~~NodeKiller not working in 100 node 1.16 performance tests~~ NodeKiller seems to be not working in 100 node 1.17 / master performance tests Feb 4, 2020

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 4, 2020

mm4tt added a commit to mm4tt/test-infra that referenced this issue Feb 4, 2020

Revert "Use newest gcloud build in gce-cos-1.16-scalability-100 job"

dd85269

This reverts commit f00041e. It didn't help, see kubernetes/perf-tests#1005

mm4tt mentioned this issue Feb 4, 2020

Revert "Use newest gcloud build in gce-cos-1.16-scalability-100 job" kubernetes/test-infra#16115

Merged

k8s-ci-robot assigned jprzychodzen Mar 4, 2020

jprzychodzen mentioned this issue Mar 4, 2020

[NodeKiller] Change rounding of a number of selected nodes #1107

Merged

k8s-ci-robot closed this as completed Mar 10, 2020

k8s-ci-robot reopened this Mar 30, 2020

jprzychodzen mentioned this issue Mar 30, 2020

[Scalability] Enable experimental Node Killer kubernetes/test-infra#17015

Merged

jprzychodzen mentioned this issue Apr 14, 2020

[Failing Test] gce-cos-.*-scalability tests are failing since 03/06 kubernetes/kubernetes#89051

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005

NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 4, 2020

mm4tt commented Feb 4, 2020

k8s-ci-robot commented Feb 4, 2020

mm4tt commented Feb 4, 2020

jprzychodzen commented Mar 4, 2020

jprzychodzen commented Mar 4, 2020

jprzychodzen commented Mar 10, 2020

jprzychodzen commented Mar 10, 2020

k8s-ci-robot commented Mar 10, 2020

mm4tt commented Mar 30, 2020

k8s-ci-robot commented Mar 30, 2020

fejta-bot commented Jun 28, 2020

jkaniuk commented Jun 29, 2020

fejta-bot commented Sep 27, 2020

wojtek-t commented Sep 28, 2020

NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005

NodeKiller seems to be not working in 100 node 1.17 / master performance tests #1005

Comments

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 3, 2020

mm4tt commented Feb 4, 2020

mm4tt commented Feb 4, 2020

k8s-ci-robot commented Feb 4, 2020

mm4tt commented Feb 4, 2020

jprzychodzen commented Mar 4, 2020

jprzychodzen commented Mar 4, 2020

jprzychodzen commented Mar 10, 2020

jprzychodzen commented Mar 10, 2020

k8s-ci-robot commented Mar 10, 2020

mm4tt commented Mar 30, 2020

k8s-ci-robot commented Mar 30, 2020

fejta-bot commented Jun 28, 2020

jkaniuk commented Jun 29, 2020

fejta-bot commented Sep 27, 2020

wojtek-t commented Sep 28, 2020