[CONTP-60] Improved telemetry on cluster check configs dangling #32508

gabedos · 2024-12-24T19:45:10Z

What does this PR do?

Adds a new telemetry metric to keep track of cluster check configs that have gone an extended period of time without being scheduled.

Motivation

We would like to monitor the state of the cluster-check scheduling in order to be sure we detect issues before applications get paged for missing metrics.

Describe how you validated your changes

Existing unit tests cover dispatching logic. Update to a unit test verifies the unscheduledCheck flag behavior.

Possible Drawbacks / Trade-offs

N/A

Additional Notes

Previously, on each rescheduling attempt, all the dangling configs were deleted from the dangling store initially. If they failed to be rescheduled, then they would be readded the dangling config store.

This PR changes this behavior because we want to track how long the config exists in the dangling state. Hence, we only remove the config from the dangling store once it's confirmed to be scheduled.

cit-pr-commenter · 2024-12-24T20:21:49Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 859646e6-8b04-413c-959e-826b75af9ecb

Baseline: 5bd602d
Comparison: ddcdf78
Diff

❌ Experiments with missing or malformed data

This is a critical error. No usable optimization goal data was produced by the listed experiments. This may be a result of misconfiguration. Ping #single-machine-performance and we can help out.

tcp_syslog_to_blackhole (Logs)

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_logs	% cpu utilization	+4.84	[+1.46, +8.22]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.17	[-0.53, +0.86]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.04	[-0.60, +0.68]	1	Logs
➖	otel_to_otel_logs	ingress throughput	+0.04	[-0.61, +0.68]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.10, +0.12]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.01	[-0.81, +0.84]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.01	[-0.76, +0.78]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.01, +0.01]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	-0.01	[-0.90, +0.89]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.03	[-0.97, +0.92]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.03	[-0.72, +0.66]	1	Logs
➖	file_tree	memory utilization	-0.13	[-0.26, -0.00]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.17	[-0.97, +0.63]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	-0.18	[-0.64, +0.28]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.25	[-0.33, -0.16]	1	Logs bounds checks dashboard
➖	quality_gate_idle	memory utilization	-0.67	[-0.70, -0.63]	1	Logs bounds checks dashboard

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	links
❌	file_to_blackhole_0ms_latency_http1	lost_bytes	9/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.

agent-platform-auto-pr · 2024-12-26T16:21:12Z

Uncompressed package size comparison

Comparison with ancestor 5bd602d21849fb0e17bb3fb6c545c3e40a977dd9

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-x86_64-rpm	0.21MB	⚠️	1197.51MB	1197.29MB	140.00MB
datadog-agent-x86_64-suse	0.21MB	⚠️	1197.51MB	1197.29MB	140.00MB
datadog-agent-aarch64-rpm	0.20MB	⚠️	943.40MB	943.20MB	140.00MB
datadog-agent-amd64-deb	0.16MB	⚠️	1188.19MB	1188.03MB	140.00MB
datadog-heroku-agent-amd64-deb	0.15MB	⚠️	505.24MB	505.09MB	70.00MB
datadog-agent-arm64-deb	0.15MB	⚠️	934.11MB	933.96MB	140.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.77MB	55.77MB	10.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.41MB	113.41MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.41MB	113.41MB	10.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.57MB	78.57MB	10.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.34MB	113.34MB	10.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	108.81MB	108.81MB	10.00MB
datadog-dogstatsd-x86_64-rpm	-0.00MB	✅	78.64MB	78.65MB	10.00MB
datadog-dogstatsd-x86_64-suse	-0.00MB	✅	78.64MB	78.65MB	10.00MB
datadog-iot-agent-aarch64-rpm	-0.00MB	✅	108.88MB	108.88MB	10.00MB

Decision

⚠️ Warning

agent-platform-auto-pr · 2024-12-26T16:22:38Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51849592 --os-family=ubuntu

Note: This applies to commit ddcdf78

pkg/clusteragent/clusterchecks/dangling_config.go

pkg/clusteragent/clusterchecks/dispatcher_main.go

pkg/clusteragent/clusterchecks/dispatcher_test.go

gabedos · 2024-12-30T16:33:59Z

/merge

dd-devflow · 2024-12-30T16:34:09Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2024-12-30 16:34:08 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 34m.

2024-12-30 17:10:20 UTC ℹ️ MergeQueue: This merge request was merged

Base: track dangling configs with time

ed19ca8

gabedos added the qa/done QA done before merge and regressions are covered by tests label Dec 24, 2024

github-actions bot added the medium review PR review might take time label Dec 24, 2024

gabedos added team/container-platform The Container Platform Team changelog/no-changelog labels Dec 24, 2024

gabedos added this to the 7.63.0 milestone Dec 24, 2024

Convert to counting number of reschedule attempts

e7b7ccb

gabedos force-pushed the gabedos/extend-config-dangling branch from 66a1af8 to e7b7ccb Compare December 24, 2024 21:03

gabedos added 4 commits December 26, 2024 14:43

Include count of extended dangling configs in agent status

84eebd1

safe dangling deletes

3c6c2f3

Disable addConfig force failure

d120b92

Lint fix on +=1

7c4d85e

Revert back to checking on time: base impl

eea7b33

gabedos force-pushed the gabedos/extend-config-dangling branch from 20a1bb3 to eea7b33 Compare December 26, 2024 20:31

zhuminyi reviewed Dec 26, 2024

View reviewed changes

pkg/clusteragent/clusterchecks/dangling_config.go Show resolved Hide resolved

Leverage more time methods for comparing expected schedule time

785f509

zhuminyi reviewed Dec 26, 2024

View reviewed changes

pkg/clusteragent/clusterchecks/dispatcher_main.go Outdated Show resolved Hide resolved

Update outdate method comment

3a1c7af

zhuminyi reviewed Dec 26, 2024

View reviewed changes

pkg/clusteragent/clusterchecks/dispatcher_test.go Show resolved Hide resolved

gabedos force-pushed the gabedos/extend-config-dangling branch from b2fb193 to 3a1c7af Compare December 27, 2024 14:41

Unit test for unscheduled check

ddcdf78

gabedos marked this pull request as ready for review December 27, 2024 18:21

gabedos requested review from a team as code owners December 27, 2024 18:21

gabedos requested a review from jeremy-hanna December 27, 2024 18:21

zhuminyi approved these changes Dec 27, 2024

View reviewed changes

clamoriniere approved these changes Dec 28, 2024

View reviewed changes

pgimalac approved these changes Dec 30, 2024

View reviewed changes

dd-mergequeue bot merged commit dc3b8fb into main Dec 30, 2024
221 checks passed

dd-mergequeue bot deleted the gabedos/extend-config-dangling branch December 30, 2024 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CONTP-60] Improved telemetry on cluster check configs dangling #32508

[CONTP-60] Improved telemetry on cluster check configs dangling #32508

gabedos commented Dec 24, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 24, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

gabedos commented Dec 30, 2024

dd-devflow bot commented Dec 30, 2024 •

edited

Loading

[CONTP-60] Improved telemetry on cluster check configs dangling #32508

[CONTP-60] Improved telemetry on cluster check configs dangling #32508

Conversation

gabedos commented Dec 24, 2024 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

cit-pr-commenter bot commented Dec 24, 2024 • edited Loading

Regression Detector

Regression Detector Results

❌ Experiments with missing or malformed data

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

agent-platform-auto-pr bot commented Dec 26, 2024 • edited Loading

Uncompressed package size comparison

Decision

agent-platform-auto-pr bot commented Dec 26, 2024 • edited Loading

Test changes on VM

gabedos commented Dec 30, 2024

dd-devflow bot commented Dec 30, 2024 • edited Loading

Devflow running: /merge

gabedos commented Dec 24, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 24, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 26, 2024 •

edited

Loading

dd-devflow bot commented Dec 30, 2024 •

edited

Loading

Devflow running: `/merge`