Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Prometheus rule alerts acc to specified value & correct the default values in API desc #2816

Merged

Conversation

malayparida2000
Copy link
Contributor

@malayparida2000 malayparida2000 commented Sep 23, 2024

Update prometheusRule to use the fullRatio values from storageCluster CR

Earlier the cluster utilization alert rules (CephClusterNearFull,
CephClusterCriticallyFull, CephClusterReadOnly) and the osd alert rules
(CephOSDNearFull, CephOSDCriticallyFull) were hardcoded to use the
nearFullRatio 0.75, criticallyFullRatio 0.80, and fullRatio 0.85 values.

But these values are now configurable on the storageCluster CR. So the
prometheus rules for these alerts will now be updated to use the
specified values if provided in the storageCluster CR.

This also includes the refactor of the changing the prometheus rule
process. The function is now easier to read, maintain & expand.
Also add tests for prometheus rule changing process.

Correct the API desc about the defaults for Full Ratios in OCS-Operator

While adding the fields the description of the fields were directly
lifted off from rook-operator. The values of NearFull, BackfillFull &
Full in rook are 0.85, 0.90 & 0.95 respectively. But in OCS Operator
we set these values to 0.75, 0.80 & 0.85 respectively with the help
of the rook-config-override ConfigMap. So the description of the fields
in the API should reflect the actual values that are set in OCS.

Story-https://issues.redhat.com/browse/RHSTOR-6497
BZ-https://bugzilla.redhat.com/show_bug.cgi?id=2303342

@malayparida2000
Copy link
Contributor Author

/hold
will change more things

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024
@malayparida2000 malayparida2000 changed the title Update the prometheus rules for Ceph OSD full alerts acc to value given & correct the values in API and rules Update the Prometheus rules for alerts(cluster utilization & osd full) acc to specified value & correct the default values in API desc Sep 23, 2024
@malayparida2000 malayparida2000 changed the title Update the Prometheus rules for alerts(cluster utilization & osd full) acc to specified value & correct the default values in API desc Update the Prometheus rules for alerts acc to specified value & correct the default values in API desc Sep 23, 2024
@malayparida2000 malayparida2000 force-pushed the prometheus_rule branch 2 times, most recently from fb64ae6 to d281e59 Compare September 24, 2024 08:40
@malayparida2000
Copy link
Contributor Author

/cherry-pick release-4.17

@openshift-cherrypick-robot

@malayparida2000: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@malayparida2000
Copy link
Contributor Author

PR is ready for review, removing hold
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2024
@malayparida2000 malayparida2000 changed the title Update the Prometheus rules for alerts acc to specified value & correct the default values in API desc Update the Prometheus rule alerts acc to specified value & correct the default values in API desc Sep 24, 2024
@malayparida2000 malayparida2000 changed the title Update the Prometheus rule alerts acc to specified value & correct the default values in API desc Update the Prometheus rule alerts acc to specified value & correct the default values in API Sep 24, 2024
@malayparida2000 malayparida2000 changed the title Update the Prometheus rule alerts acc to specified value & correct the default values in API Update the Prometheus rule alerts acc to specified value & correct the default values in API desc Sep 24, 2024
@malayparida2000
Copy link
Contributor Author

/cc @travisn
/cc @aruniiird

@openshift-ci openshift-ci bot requested a review from travisn September 24, 2024 08:52
Copy link
Contributor

openshift-ci bot commented Sep 24, 2024

@malayparida2000: GitHub didn't allow me to request PR reviews from the following users: aruniiird.

Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @travisn
/cc @aruniiird

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
Copy link
Member

@iamniting iamniting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold for Travis and Arun review

controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 24, 2024
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 24, 2024
@iamniting iamniting removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2024
@malayparida2000
Copy link
Contributor Author

Full testing of the feature on this PR is documented here https://hackmd.io/@Yh4a4hAATcW2BNYBJVSx4w/SJaY57gCR

Copy link
Member

@iamniting iamniting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold for Travis and Arun review

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 24, 2024
controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2024
While adding the fields the description of the fields were directly
lifted off from rook-operator. The values of NearFull, BackfillFull &
Full in rook are 0.85, 0.90 & 0.95 respectively. But in OCS Operator
we set these values to 0.75, 0.80 & 0.85 respectively with the help
of the rook-config-override ConfigMap. So the description of the fields
in the API should reflect the actual values that are set in OCS.

Signed-off-by: Malay Kumar Parida <[email protected]>
Earlier the cluster utilization alert rules (CephClusterNearFull,
CephClusterCriticallyFull, CephClusterReadOnly) and the osd alert rules
(CephOSDNearFull, CephOSDCriticallyFull) were hardcoded to use the
nearFullRatio 0.75, criticallyFullRatio 0.80, and fullRatio 0.85 values.

But these values are now configurable on the storageCluster CR. So the
prometheus rules for these alerts will now be updated to use the
specified values if provided in the storageCluster CR.

This also includes the refactor of the changing the prometheus rule
process. The function is now easier to read, maintain & expand.
Also add tests for prometheus rule changing process.

Signed-off-by: Malay Kumar Parida <[email protected]>
Copy link
Contributor

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small suggestion.

api/v1/storagecluster_types.go Show resolved Hide resolved
Copy link
Contributor

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2024
Copy link
Contributor

openshift-ci bot commented Sep 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, malayparida2000, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@malayparida2000
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 25, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit c90b932 into red-hat-storage:main Sep 26, 2024
11 checks passed
@openshift-cherrypick-robot

@malayparida2000: new pull request created: #2820

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@aruniiird aruniiird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an extensive redesign of replaceTokens, please take a look.

@@ -1180,38 +1203,64 @@ func applyLabels(labels map[string]string, t *metav1.ObjectMeta) {
}
}

type exprReplaceToken struct {
type replaceToken struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change (or elaborate) the replaceToken struct? This will make our code much more cleaner

Suggested change
type replaceToken struct {
type replaceToken struct {
groupName string
recordOrAlertName string
wordToReplace string
replaceWith string
// wordInSection represents
// which field in the rule this change has to go
wordInSection RuleSection
// sectionKey is the map's key, if the above RuleSection is a map (like Annotations or Labels)
sectionKey string
}
// RuleSection determines which section in the Rule struct the change has to go in
type RuleSection int
const (
ExprRuleSection RuleSection = iota
AnnotationRuleSection
LabelRuleSection
)

replaceWith string
}

func changePromRuleExpr(promRules *monitoringv1.PrometheusRule, replaceTokens []exprReplaceToken) {
if promRules == nil {
func createReplaceToken(groupName, recordOrAlertName, wordToReplace, replaceWith string) replaceToken {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change createReplaceToken function (if we add the RuleSection part)

Suggested change
func createReplaceToken(groupName, recordOrAlertName, wordToReplace, replaceWith string) replaceToken {
func createReplaceToken(groupName, recordOrAlertName, wordToReplace, replaceWith string, wordInSection RuleSection, sectionKey string) replaceToken {

@@ -1148,16 +1148,39 @@ func createPrometheusRules(r *StorageClusterReconciler, sc *ocsv1.StorageCluster
return err
}
applyLabels(getCephClusterMonitoringLabels(*sc), &prometheusRule.ObjectMeta)
replaceTokens := []exprReplaceToken{

replaceTokens := []replaceToken{
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be consistent if we call createReplaceToken function here as well

Comment on lines +1165 to +1179
if specifiedNearFullRatio != nil {
replaceTokens = append(replaceTokens,
createReplaceToken("", "", "75%", fmt.Sprintf("%.2f%%", *specifiedNearFullRatio*100)),
createReplaceToken("", "", "0.75", fmt.Sprintf("%f", *specifiedNearFullRatio)))
}
if specifiedBackfillFullRatio != nil {
replaceTokens = append(replaceTokens,
createReplaceToken("", "", "80%", fmt.Sprintf("%.2f%%", *specifiedBackfillFullRatio*100)),
createReplaceToken("", "", "0.80", fmt.Sprintf("%f", *specifiedBackfillFullRatio)))
}
if specifiedFullRatio != nil {
replaceTokens = append(replaceTokens,
createReplaceToken("", "", "85%", fmt.Sprintf("%.2f%%", *specifiedFullRatio*100)),
createReplaceToken("", "", "0.85", fmt.Sprintf("%f", *specifiedFullRatio)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't require a condition check and then append because we will check whether the given word and the replacing/new word are same or not in changePromRule function. So directly append the values, with a slight modification in the fmt.Sprintf format,

PS: don't be confused with the additional parameters like RuleSection and DescriptionKey part. Explained it in the below comments, please check

Suggested change
if specifiedNearFullRatio != nil {
replaceTokens = append(replaceTokens,
createReplaceToken("", "", "75%", fmt.Sprintf("%.2f%%", *specifiedNearFullRatio*100)),
createReplaceToken("", "", "0.75", fmt.Sprintf("%f", *specifiedNearFullRatio)))
}
if specifiedBackfillFullRatio != nil {
replaceTokens = append(replaceTokens,
createReplaceToken("", "", "80%", fmt.Sprintf("%.2f%%", *specifiedBackfillFullRatio*100)),
createReplaceToken("", "", "0.80", fmt.Sprintf("%f", *specifiedBackfillFullRatio)))
}
if specifiedFullRatio != nil {
replaceTokens = append(replaceTokens,
createReplaceToken("", "", "85%", fmt.Sprintf("%.2f%%", *specifiedFullRatio*100)),
createReplaceToken("", "", "0.85", fmt.Sprintf("%f", *specifiedFullRatio)))
}
const DescriptionKey = "description"
replaceTokens = append(replaceTokens,
createReplaceToken("cluster-utilization-alert.rules", "CephClusterNearFull", "75%", fmt.Sprintf("%0.0f%%", nearFullRatio*100), AnnotationRuleSection, DescriptionKey),
createReplaceToken("cluster-utilization-alert.rules", "CephClusterNearFull", "0.75", fmt.Sprintf("%0.2f", nearFullRatio), ExprRuleSection, ""),
createReplaceToken("cluster-utilization-alert.rules", "CephClusterCriticallyFull", "80%", fmt.Sprintf("%0.0f%%", criticallyFullRatio*100), AnnotationRuleSection, DescriptionKey),
createReplaceToken("cluster-utilization-alert.rules", "CephClusterCriticallyFull", "0.80", fmt.Sprintf("%0.0f", criticallyFullRatio), ExprRuleSection, ""),
createReplaceToken("cluster-utilization-alert.rules", "CephClusterReadOnly", "85%", fmt.Sprintf("%0.0f%%", fullRatio*100), AnnotationRuleSection, DescriptionKey),
createReplaceToken("cluster-utilization-alert.rules", "CephClusterReadOnly", "0.85", fmt.Sprintf("%0.2f", fullRatio), ExprRuleSection, ""),
)

Comment on lines +1245 to 1263
for ruleIdx := range group.Rules {
rule := &group.Rules[ruleIdx]
// If recordOrAlertName is specified, ensure it matches; otherwise, apply to all rules
if token.recordOrAlertName == "" || rule.Record == token.recordOrAlertName || rule.Alert == token.recordOrAlertName {
// Update the annotations in the rule
if rule.Annotations != nil {
// Update description if it exists
if description, exists := rule.Annotations["description"]; exists {
newDescription := strings.Replace(description, token.wordToReplace, token.replaceWith, -1)
rule.Annotations["description"] = newDescription
}
}
// Update the expression field in the rule
exprStr := rule.Expr.String()
if exprStr != "" {
newExpr := strings.Replace(exprStr, token.wordToReplace, token.replaceWith, -1)
rule.Expr = intstr.Parse(newExpr)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we apply the RuleSection (and other needed changes), changePromRule() function can be cleaned out as follows,

Suggested change
for ruleIdx := range group.Rules {
rule := &group.Rules[ruleIdx]
// If recordOrAlertName is specified, ensure it matches; otherwise, apply to all rules
if token.recordOrAlertName == "" || rule.Record == token.recordOrAlertName || rule.Alert == token.recordOrAlertName {
// Update the annotations in the rule
if rule.Annotations != nil {
// Update description if it exists
if description, exists := rule.Annotations["description"]; exists {
newDescription := strings.Replace(description, token.wordToReplace, token.replaceWith, -1)
rule.Annotations["description"] = newDescription
}
}
// Update the expression field in the rule
exprStr := rule.Expr.String()
if exprStr != "" {
newExpr := strings.Replace(exprStr, token.wordToReplace, token.replaceWith, -1)
rule.Expr = intstr.Parse(newExpr)
}
}
for ruleIdx, rule := range group.Rules {
// If recordOrAlertName is specified, ensure it matches; otherwise, apply to all rules
if token.recordOrAlertName == "" || rule.Record == token.recordOrAlertName || rule.Alert == token.recordOrAlertName {
switch token.wordInSection {
case ExprRuleSection:
// Update the expression field in the rule
newExpr := strings.ReplaceAll(rule.Expr.String(), token.wordToReplace, token.replaceWith)
promRule.Spec.Groups[groupIdx].Rules[ruleIdx].Expr = intstr.Parse(newExpr)
case AnnotationRuleSection:
if rule.Annotations == nil { // if no annotations, continue
continue
}
annotMap := rule.Annotations
annotMap[token.sectionKey] = strings.ReplaceAll(
annotMap[token.sectionKey], token.wordToReplace, token.replaceWith)
case LabelRuleSection:
if rule.Labels == nil { // if there are no labels, continue
continue
}
labelMap := rule.Labels
labelMap[token.sectionKey] = strings.ReplaceAll(
labelMap[token.sectionKey], token.wordToReplace, token.replaceWith)
}
}
}

var changeTokens = []exprReplaceToken{
{recordOrAlertName: "CephMgrIsAbsent", wordInExpr: "openshift-storage", replaceWith: "new-namespace"},
var changeTokens = []replaceToken{
{recordOrAlertName: "CephMgrIsAbsent", wordToReplace: "openshift-storage", replaceWith: "new-namespace"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use the createReplaceToken() function, rather than creating raw structs

@@ -1041,29 +1041,40 @@ func TestParsePrometheusRules(t *testing.T) {
}

func TestChangePrometheusExprFunc(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the test function name as well, something like TestChangePrometheusRulesFunc ?

Comment on lines +1074 to +1077
assert.Assert(t,
strings.Contains(rule.Expr.String(), changeStr) ||
(rule.Annotations != nil && strings.Contains(rule.Annotations["description"], changeStr)),
fmt.Sprintf("Expected '%s' to be found in either Expr or Annotations for alert %s", changeStr, alertName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed changes to the test function,

Suggested change
assert.Assert(t,
strings.Contains(rule.Expr.String(), changeStr) ||
(rule.Annotations != nil && strings.Contains(rule.Annotations["description"], changeStr)),
fmt.Sprintf("Expected '%s' to be found in either Expr or Annotations for alert %s", changeStr, alertName))
// below is a small hack to determine whether we have to check 'Annotation' or 'Expr'
// if it contains a %-age sign, then check the annotation description
if strings.Contains(changeStr, "%") {
assert.Assert(t,
strings.Contains(rule.Annotations[DescriptionKey], changeStr),
"Annotation description doesn't mach", "Received Annotation Description",
rule.Annotations[DescriptionKey], "Doesn't contain", changeStr)
} else { // check the expression
assert.Assert(t,
strings.Contains(rule.Expr.String(), changeStr), "Expression doesn't mach",
"Received Expression", rule.Expr, "Doesn't have", changeStr)
}

Copy link
Contributor

openshift-ci bot commented Sep 26, 2024

@aruniiird: changing LGTM is restricted to collaborators

In response to this:

Adding an extensive redesign of replaceTokens, please take a look.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@malayparida2000
Copy link
Contributor Author

Acc to discussion, @aruniiird would raise a follow-up to refactor PR to clean up the flow a bit more, but that would be only on the main & will not be backported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants