Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added metrics to keep count of "unread columns" in updater component. #1139

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

moki1202
Copy link
Contributor

Fixes #1101

@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moki1202
Once this PR has been reviewed and has the lgtm label, please assign sultan-duisenbay for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow
Copy link
Contributor

Hi @moki1202. Thanks for your PR.

I'm waiting for a GoogleCloudPlatform member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@moki1202
Copy link
Contributor Author

@michelle192837 tried my best here. There are errors for these changes that need fixing but am I headed in the right direction? Also the names that I've decided for the metric field might just not be right 😅

@michelle192837
Copy link
Collaborator

/ok-to-test

@michelle192837
Copy link
Collaborator

michelle192837 commented Feb 23, 2023

@michelle192837 tried my best here. There are errors for these changes that need fixing but am I headed in the right direction? Also the names that I've decided for the metric field might just not be right 😅

I think this looks good so far. ^^ I agree another name might be a bit clearer ('IncompleteUpdates' or something similar, for instance?).

@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Feb 26, 2023
@moki1202
Copy link
Contributor Author

@michelle192837 tried my best here. There are errors for these changes that need fixing but am I headed in the right direction? Also the names that I've decided for the metric field might just not be right 😅

I think this looks good so far. ^^ I agree another name might be a bit clearer ('IncompleteUpdates' or something similar, for instance?).

@michelle192837 done!

@moki1202 moki1202 marked this pull request as ready for review February 27, 2023 22:00
@@ -107,7 +108,8 @@ func GCS(poolCtx context.Context, colClient gcs.Client, groupTimeout, buildTimeo
defer cancel()
gcsColReader := gcsColumnReader(colClient, buildTimeout, readResult, enableIgnoreSkip)
reprocess := 20 * time.Minute // allow 20m for prow to finish uploading artifacts
return InflateDropAppend(ctx, log, client, tg, gridPath, write, gcsColReader, reprocess)
mets := CreateMetrics(prometheus.NewFactory())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! So actually, this doesn't need to be created here, the metrics get created in the corresponding 'main.go' file under cmd/, e.g. https://github.com/GoogleCloudPlatform/testgrid/blob/master/cmd/updater/main.go#L168. (I believe you're already good to go and can revert these two lines.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michelle192837 Understood! just 1 more doubt. Here, we pass mets as a param to InflateDropAppend, so the return statement needs that mets param. What should I pass here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bubbling it down from update.Update(), refactoring functions along the way, is probably the way to go. So, in this case, adding it to func GCS(...) would be my approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I missed adding this to InflateDropAppend's call somehow. +1 to what Sean said!

Copy link
Contributor Author

@moki1202 moki1202 Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chases2 I don't understand this. 😅 what exactly should I add in GCS function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try changing its signature to "func GCS(poolCtx context.Context, colClient gcs.Client, mets *Metrics, groupTimeout, buildTimeout time.Duration, concurrency int, write bool, enableIgnoreSkip bool) GroupUpdater"

Copy link
Contributor Author

@moki1202 moki1202 Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I get it now! Thank you Sean.

@moki1202 moki1202 force-pushed the add_metric branch 2 times, most recently from ad22f18 to 9d47c6b Compare March 2, 2023 12:48
@moki1202
Copy link
Contributor Author

moki1202 commented Mar 2, 2023

@michelle192837 @chases2 does this look good now?
Edit- there's still an instance in a test function that I will fix once I get the confirmation on the above changes.

DelaySeconds: factory.NewDuration("delay", "Seconds updater is behind schedule", "component"),
UpdateState: factory.NewCyclic(componentName),
DelaySeconds: factory.NewDuration("delay", "Seconds updater is behind schedule", "component"),
IncompleteUpdates: factory.NewCounter("counter", "number of unread columns"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please name this counter something more descriptive, such as "incomplete-updates".
Amend the description also, please. This counts the number of update attempts that don't complete, not the number of columns that were skipped over.

Note here that the arguments for NewCounter are (name, description, ...any additional fields to capture)

@chases2
Copy link
Collaborator

chases2 commented Mar 2, 2023

@michelle192837 @chases2 does this look good now? Edit- there's still an instance in a test function that I will fix once I get the confirmation on the above changes.

Looks good overall! One readability note on how the metrics are named, though

@chases2
Copy link
Collaborator

chases2 commented Mar 2, 2023

We need a bit more documentation on this (#1120), but you can also check the way your new metric will look in prometheus if you'd like:

  1. Run the updater with bazel run //cmd/updater -- --wait=1h
    • The "--wait" flag makes it rerun continuously
  2. Go to localhost:2112, the metrics port for prometheus

@chases2
Copy link
Collaborator

chases2 commented May 2, 2023

@moki1202 Are you interested in working on this still? I think this would be a serious improvement, and it seems like it's most of the way complete with what you have currently.

@moki1202
Copy link
Contributor Author

moki1202 commented May 2, 2023

Nooooo! I totally forgot about this! @chases2 Thanks for the reminder. Will definitely complete this ASAP.

@michelle192837
Copy link
Collaborator

/retest

@moki1202
Copy link
Contributor Author

@chases2 I've pushed the remaining changes. Please let me know where this needs improvement. I'll try to fix it asap.

@michelle192837
Copy link
Collaborator

Changes look good to me! Only issue is I'm not sure why the unit test is failing, going to retest one more time but if it's consistent it might need another look to see if something from this PR is affecting it.

@michelle192837
Copy link
Collaborator

/retest

@michelle192837
Copy link
Collaborator

You can also check if this is consistent with your change, bazel test //pkg/... --runs_per_test=100.
That is weirdly consistent for no reason I can tell, but it's not failing like this at HEAD. Might be worth rebasing if you haven't already?

Signed-off-by: Shashank <[email protected]>
Signed-off-by: Shashank <[email protected]>
@google-oss-prow
Copy link
Contributor

@moki1202: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
test-testgrid-all d335ff6 link true /test test-testgrid-all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@moki1202
Copy link
Contributor Author

moki1202 commented Jun 1, 2023

@michelle192837 I've rebased my branch. The unit tests still fail for some reason.

@chases2
Copy link
Collaborator

chases2 commented Jun 2, 2023

From these logs: https://oss.gprow.dev/view/gs/oss-prow/pr-logs/pull/GoogleCloudPlatform_testgrid/1139/test-testgrid-all/1663927888350547968#1:build-log.txt%3A189-190

Something is going on with these code changes that is causing the update function to upload to files... a different number of times? It's not clear to me why that's important for this test. I do wonder if it's flaky or exactly what's going on, but I don't think this should be blocking.

@michelle192837 Has this updater test flaked before? I feel like I remember something like that in the past.

Consider modifying the test to expect the new generation for now.

@michelle192837
Copy link
Collaborator

From these logs: https://oss.gprow.dev/view/gs/oss-prow/pr-logs/pull/GoogleCloudPlatform_testgrid/1139/test-testgrid-all/1663927888350547968#1:build-log.txt%3A189-190

Something is going on with these code changes that is causing the update function to upload to files... a different number of times? It's not clear to me why that's important for this test. I do wonder if it's flaky or exactly what's going on, but I don't think this should be blocking.

@michelle192837 Has this updater test flaked before? I feel like I remember something like that in the past.

Consider modifying the test to expect the new generation for now.

I think it has flaked before, though not in this specific way (there's a couple other cases that I think are flaky, but they aren't replicating for me with 100 runs from head).

@moki1202
Copy link
Contributor Author

moki1202 commented Jun 7, 2023

@michelle192837 @chases2 is there something that I can do to help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track a 'more' metric for the Updater
3 participants