Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cpu usage of powershell processes triggered by csi-proxy #193

Closed
dazhiw opened this issue Jan 13, 2022 · 22 comments · Fixed by #197 or #336 · May be fixed by #360
Closed

High cpu usage of powershell processes triggered by csi-proxy #193

dazhiw opened this issue Jan 13, 2022 · 22 comments · Fixed by #197 or #336 · May be fixed by #360
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@dazhiw
Copy link

dazhiw commented Jan 13, 2022

What happened:
We are doing some performance tests for vsphere-csi-driver (https://github.com/kubernetes-sigs/vsphere-csi-driver) on k8s cluster with 20 windows worker nodes. Each node has 4 CPUs and 16GB of memory. k8s version is 1.22.3, windows version is Windows Server 2019. the vsphere csi node plugin uses csi-proxy for privileged operations on host devices. The test repeatedly issues CreatePod followed by DeletePod. Each Pod just starts one container with image "mcr.microsoft.com/oss/kubernetes/pause:1.4.1", and mounts one persistent volume.

During the test we observed high cpu usage on the windows worker nodes, and most of the cpu usage came from the powershell processes triggered by csi-proxy. With 1.4 Pod creations per second, the average cpu usage on each node is about 75% of the total cpu capacity, the CPU cost of powershell processes triggered by csi-proxy is about 45% of the total CPU capacity; With 2.4 Pod creations per second, the cpu usage on each node reached almost 100%, and the CPU cost of powershell processes triggered by csi-proxy is about 68%. As a comparison, for vsphere-csi-driver on Linux nodes, with 3 Pod creations per second the cpu usage on each node is constantly below 10%.

What you expected to happen:
Reduce the cpu cost of csi-proxy.

How to reproduce it:
deploy vsphere-csi-driver on k8s cluster with windows nodes, and create Pod with PV mount, then delete Pod.

Anything else we need to know?:

Environment:

  • CSI Driver version: vsphere CSI 2.5
  • Kubernetes version (use kubectl version): 1.22.3
  • OS (e.g. from /etc/os-release): Windows Server 2019
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@divyenpatel
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@divyenpatel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@divyenpatel
Copy link

@dazhiw

@pradeep-hegde change is partially fixing the issue and not getting much-needed performance improvement.
Can we keep this issue open?

@k8s-ci-robot
Copy link
Contributor

@wdazhi2020: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dazhiw
Copy link
Author

dazhiw commented Jun 30, 2022

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jun 30, 2022
@k8s-ci-robot
Copy link
Contributor

@dazhiw: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Contributor

msau42 commented Jul 20, 2022

@divyenpatel did we compare with intree Windows? It is known that Windows consumes more resources than Linux. The question is do we think there's a regression between intree vs csi Windows?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022
@mauriciopoppe
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2022
@mauriciopoppe
Copy link
Member

@alexander-ding please post the results of your experiments here

@alexander-ding
Copy link
Contributor

@alexander-ding please post the results of your experiments here

The experiments I ran were not related to resource usage unfortunately. I was measuring latency and throughput for CSI Proxy. The resource usage related experiments were on Linux nodes.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mauriciopoppe mauriciopoppe reopened this Apr 15, 2024
@mauriciopoppe
Copy link
Member

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 15, 2024
@knabben
Copy link
Contributor

knabben commented Apr 16, 2024

Circling this back.. tracked from a lot of calls to Get-Item -Path $Env:mountpath).Target that seems to hang in process explorer, even when all the pods with PVCs are deleted.

I0416 10:23:52.948472    9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\83a8a951-2b74-4cee-bafc-b6cb6790ddbd\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.948937    9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\0d746ae0-6a44-4b78-9a5b-d05fa146fd06\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.953077    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:52.955659    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.322869    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.483584    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.486178    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.519391    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:55.167002    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:56.050764    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:57.028808    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:58.569111    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.037196    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.052682    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.142338    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.402459    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.629165    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:24:00.037973    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"

Dumping the stacktrace it shows a very odd call to the same function, what is very different from the normal behavior.

runtime/debug.Stack()                                                                                                                                                                                                                                                   [117/1942]
        /home/aknabben/.asdf/installs/golang/1.22.1/go/src/runtime/debug/stack.go:24 +0x5e
github.com/kubernetes-csi/csi-proxy/pkg/utils.RunPowershellCmd({0x1756b3d, 0x26}, {0xc0002e3c40, 0x1, 0x1?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/utils/utils.go:17 +0xee
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eb200?, 0xc000406400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:283 +0xa5
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eaf00?, 0xc0000d1c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218c00?, 0xc0001de400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eac00?, 0xc0000d1400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0003ca300?, 0xc00007d000?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218900?, 0xc000117c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000ea900?, 0xc0000d0c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218600?, 0xc000117400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
...
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.VolumeAPI.GetVolumeIDFromTargetPath({}, {0xc0000f3290, 0x85})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:271 +0x25
github.com/kubernetes-csi/csi-proxy/pkg/server/volume.(*Server).GetVolumeIDFromTargetPath(0xc0003b80a0, {0x0?, 0xc000161008?}, 0xc000415360, {0x1, 0x2, 0x0, {0x173b627, 0x2}})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/server.go:281 +0x133
github.com/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1.(*versionedAPI).GetVolumeIDFromTargetPath(0xc0003b81c0, {0x180b870, 0xc0004f45d0}, 0xc0005c0c40)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1/server_generated.go:74 +0xbc
github.com/kubernetes-csi/csi-proxy/client/api/volume/v1._Volume_GetVolumeIDFromTargetPath_Handler({0x170c140, 0xc0003b81c0}, {0x180b870, 0xc0004f45d0}, 0xc00011ee00, 0x0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/github.com/kubernetes-csi/csi-proxy/client/api/volume/v1/api.pb.go:1799 +0x1a6
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003f1a40, {0x180b870, 0xc0004f4420}, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0, 0xc0003b34d0, 0x1c09520, 0x0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003f1a40, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 122
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:997 +0x136

@andyzhangx
Copy link
Member

for VolumeStats, we could use windows.GetDiskFreeSpaceEx api instead, that could avoid such issue, check the refined PR here:
kubernetes-sigs/azuredisk-csi-driver#2287

@randomvariable
Copy link

The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?

@mauriciopoppe
Copy link
Member

Moving from powershell to win32 API would be a nice change at the expense of making the code harder to read e.g.

func (DiskAPI) GetDiskPage83ID(disk syscall.Handle) (string, error) {
is harder to read than using a Powershell command, for this case I think there wasn't a powershell command so that's why the team implemented it with the Win32 API directly.

For the current issue, @knabben did a great analysis in #193 (comment) about the current implementation, and also opened a PR fixing it in #336 but we haven't been able to merge it because presubmit tests are broken because of a Windows module that was disabled in Github Actions. I was trying to fix it in #348 but didn't have luck.

For the next steps, the plan of action was to:

The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?

Changes would be blocked by the presubmit error, we should try to fix the SMB tests first.

@randomvariable
Copy link

randomvariable commented Oct 10, 2024

If we're not keen on using Win32 API directly because of readability, then we should strongly consider using the MSFT_Disk & Win32_DiskPartition WMI classes, which is what PowerShell is wrapping, but without the .NET overhead.

@laozc
Copy link

laozc commented Oct 14, 2024

You may check how it would be like to switch to use WMI classes from this WIP branch.
laozc@b7483ac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet