Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(k8s): Trivy gets stuck when scanning a cluster with taints on nodes #8087

Open
afdesk opened this issue Dec 12, 2024 Discussed in #5639 · 3 comments
Open

bug(k8s): Trivy gets stuck when scanning a cluster with taints on nodes #8087

afdesk opened this issue Dec 12, 2024 Discussed in #5639 · 3 comments
Assignees
Labels
bug target/kubernetes Issues relating to kubernetes cluster scanning

Comments

@afdesk
Copy link
Contributor

afdesk commented Dec 12, 2024

Description

When using Trivy to scan a Kubernetes cluster, the scan process gets stuck if any node in the cluster has taints applied. For example, a control-plane node with the taint node-role.kubernetes.io/control-plane causes this issue.

2024-12-12T17:56:21+06:00	FATAL	Fatal error	get k8s artifacts with node info error: running node-collector job: runner received timeout

To improve usability, Trivy should handle such cases more gracefully. It could skip nodes that cannot be scanned without additional tolerations applied, instead of causing the scan to get stuck.

Desired behavior:

  • Skip nodes that require tolerations to scan.
  • Provide clear warnings or logs about the skipped nodes.

Workaround:

You can set up tolirations through a specific flag:

$ trivy k8s --report summary --tolerations node-role.kubernetes.io/control-plane="":NoSchedule

Steps to Reproduce

$ kind delete cluster --name cilium && kind create cluster --config config.yaml
$ kubectl get nodes  
$ trivy k8s --report summary
config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: cilium
nodes:
  # controlplane nodes
- role: control-plane
  image: kindest/node:v1.31.2
- role: control-plane
  image: kindest/node:v1.31.2
- role: control-plane
  image: kindest/node:v1.31.2
  # worker nodes
- role: worker
  image: kindest/node:v1.31.2
- role: worker
  image: kindest/node:v1.31.2
- role: worker
  image: kindest/node:v1.31.2

Discussed in #5639 (comment)

@afdesk afdesk added target/kubernetes Issues relating to kubernetes cluster scanning bug labels Dec 12, 2024
@afdesk afdesk self-assigned this Dec 12, 2024
@ak2766
Copy link

ak2766 commented Dec 13, 2024

I've discovered why I'm getting failures.

I'm running .trivy ... commands on my laptop based here in Melbourne, Australia, against a cluster in AWS us-east-2 region. I believe the latency is playing havoc for me. I've just successfully finished a scan using an LXC container local to the cluster. After that, I ran another scan from my local laptop with --timeout 1h and it finished successfully and this time I get to see the output from all scanner types.

I was starting to feel dense thinking I just couldn't master such a simple task. I believe using --scanners vuln was helping in that the scan was limited to just the vulnerability scan.

@afdesk - Thanks muchly for your patience.

The only critique I have is that there's no indication anything is happening when the scan actually begins - i.e. after the downloads are done. A progress bar would go a long way to alleviate the urge to ctrl-c out of it thinking it is stuck...

@ak2766
Copy link

ak2766 commented Dec 13, 2024

A full scan for me took ~18 minutes. I think the default of 5m is too short for a full cluster scan especially when there are multiple deployments in a cluster - I only have 25 deployments in this PoC cluster. Increasing this to 30 minutes and adding a progress bar will go a long way in helping those new to Trivy and running scans against remote clusters.

Alternatively, maybe implement something like how k8s does CrashLoopBackOff but in reverse - i.e. reset the timeout counter whenever the scan moves a step forward.

@ak2766
Copy link

ak2766 commented Dec 13, 2024

One other thing I forgot to mention was that my v1.30.4 cluster is still using master for the taints - i.e.:

$ > kubectl get nodes -l node-role.kubernetes.io/control-plane -o jsonpath="{range .items[*]}{.metadata.name}{': '}{.spec.taints}{'\n'}{end}"
node1: [{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"}]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug target/kubernetes Issues relating to kubernetes cluster scanning
Projects
None yet
Development

No branches or pull requests

2 participants