-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added metrics to keep count of "unread columns" in updater component. #1139
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: moki1202 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @moki1202. Thanks for your PR. I'm waiting for a GoogleCloudPlatform member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@michelle192837 tried my best here. There are errors for these changes that need fixing but am I headed in the right direction? Also the names that I've decided for the metric field might just not be right 😅 |
/ok-to-test |
I think this looks good so far. ^^ I agree another name might be a bit clearer ('IncompleteUpdates' or something similar, for instance?). |
@michelle192837 done! |
pkg/updater/updater.go
Outdated
@@ -107,7 +108,8 @@ func GCS(poolCtx context.Context, colClient gcs.Client, groupTimeout, buildTimeo | |||
defer cancel() | |||
gcsColReader := gcsColumnReader(colClient, buildTimeout, readResult, enableIgnoreSkip) | |||
reprocess := 20 * time.Minute // allow 20m for prow to finish uploading artifacts | |||
return InflateDropAppend(ctx, log, client, tg, gridPath, write, gcsColReader, reprocess) | |||
mets := CreateMetrics(prometheus.NewFactory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! So actually, this doesn't need to be created here, the metrics get created in the corresponding 'main.go' file under cmd/, e.g. https://github.com/GoogleCloudPlatform/testgrid/blob/master/cmd/updater/main.go#L168. (I believe you're already good to go and can revert these two lines.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michelle192837 Understood! just 1 more doubt. Here, we pass mets
as a param to InflateDropAppend
, so the return statement needs that mets
param. What should I pass here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bubbling it down from update.Update(), refactoring functions along the way, is probably the way to go. So, in this case, adding it to func GCS(...)
would be my approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I missed adding this to InflateDropAppend's call somehow. +1 to what Sean said!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chases2 I don't understand this. 😅 what exactly should I add in GCS function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try changing its signature to "func GCS(poolCtx context.Context, colClient gcs.Client, mets *Metrics
, groupTimeout, buildTimeout time.Duration, concurrency int, write bool, enableIgnoreSkip bool) GroupUpdater"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I get it now! Thank you Sean.
ad22f18
to
9d47c6b
Compare
@michelle192837 @chases2 does this look good now? |
pkg/updater/updater.go
Outdated
DelaySeconds: factory.NewDuration("delay", "Seconds updater is behind schedule", "component"), | ||
UpdateState: factory.NewCyclic(componentName), | ||
DelaySeconds: factory.NewDuration("delay", "Seconds updater is behind schedule", "component"), | ||
IncompleteUpdates: factory.NewCounter("counter", "number of unread columns"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please name this counter something more descriptive, such as "incomplete-updates".
Amend the description also, please. This counts the number of update attempts that don't complete, not the number of columns that were skipped over.
Note here that the arguments for NewCounter are (name, description, ...any additional fields to capture)
Looks good overall! One readability note on how the metrics are named, though |
We need a bit more documentation on this (#1120), but you can also check the way your new metric will look in prometheus if you'd like:
|
@moki1202 Are you interested in working on this still? I think this would be a serious improvement, and it seems like it's most of the way complete with what you have currently. |
Nooooo! I totally forgot about this! @chases2 Thanks for the reminder. Will definitely complete this ASAP. |
/retest |
@chases2 I've pushed the remaining changes. Please let me know where this needs improvement. I'll try to fix it asap. |
Changes look good to me! Only issue is I'm not sure why the unit test is failing, going to retest one more time but if it's consistent it might need another look to see if something from this PR is affecting it. |
/retest |
You can also check if this is consistent with your change, |
Signed-off-by: Shashank <[email protected]>
Signed-off-by: Shashank <[email protected]>
@moki1202: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@michelle192837 I've rebased my branch. The unit tests still fail for some reason. |
From these logs: https://oss.gprow.dev/view/gs/oss-prow/pr-logs/pull/GoogleCloudPlatform_testgrid/1139/test-testgrid-all/1663927888350547968#1:build-log.txt%3A189-190 Something is going on with these code changes that is causing the update function to upload to files... a different number of times? It's not clear to me why that's important for this test. I do wonder if it's flaky or exactly what's going on, but I don't think this should be blocking. @michelle192837 Has this updater test flaked before? I feel like I remember something like that in the past. Consider modifying the test to expect the new generation for now. |
I think it has flaked before, though not in this specific way (there's a couple other cases that I think are flaky, but they aren't replicating for me with 100 runs from head). |
@michelle192837 @chases2 is there something that I can do to help here? |
Fixes #1101