Create separate worker usage data collection and move hardware emit there #1293

timl3136 · 2023-11-14T23:39:12Z

What changed?
Create a dedicated worker usage collector
Move hardware usage emitting functionality from base worker to the worker usage collector

Why?
We want to create a separate component responsible for collecting worker usage rather than a huge code block in the base worker.

How did you test it?
Tested locally as well as tested in staging env to ensure metrics consistency and no goroutine leak.

Potential risks
Instead of using Sync.Once to ensure the goroutine is run once per host, we move the hardware emitting to decision worker only as the Sync.Once might cause test timeout as it would keep other goroutine wait until the current one returns. So if a host does not have decision worker (impossible at the moment), it's hardware metrics won't be emitted.

3vilhamster · 2023-11-16T18:20:12Z

internal/internal_worker_usage_collector.go

+
+func (w *workerUsageCollector) Start() {
+	w.wg.Add(1)
+	go func() {


Do we need to spawn a goroutine per worker? Why not ensure only 1 running?

Only the hardware emitting is once per host, all other metrics will be worker-specific. (e.g activity poll response vs. decision poll response)

For now I see only w.collectHardwareUsage() which will just spawn bunch of data into the same scope. I would suggest separating hardware emitter and worker specific metrics.

That's current design, for each type of metrics based on their origin, I will create a separate gorountine for each of them. But they would be contained under a single workerusagecollector so that their result can be collected and sent in one place

…ber-go#1270) Enable client side estimated history size exposure via API

internal/internal_worker_base.go

internal/internal_worker_usage_collector.go

taylanisikdemir · 2023-11-16T18:59:12Z

internal/internal_worker_usage_collector.go

+			case <-ticker.C:
+				// Given that decision worker and activity worker are running in the same host, we only need to collect
+				// hardware usage from one of them.
+				if w.workerType == "DecisionWorker" {


this might not be future proof and also if customer is running separate processes for decision and activity workers then we will not have the hardware usage of those hosts that only runs activity workers. we should also not create no-op workerUsageCollectors if only one of them will do the work.
@Groxx what would be your recommendation for host level metric reporting on the client side? I would like to avoid global static variables but this use case probably requires one.

We tried Sync.Once before, but that would cause issues with unit testing as it will just wait indefinitely for this routine to stop while blocking all other goroutine from closing

you can override it in the unit tests

type once interface { Do(func()) } var collectHardwareUsageOnce once

in typical startup this would be set to sync.Once:

collectHardwareUsageOnce = sync.Once{}

in test code you can initialize this to a fake implementation

collectHardwareUsageOnce = myFakeOnce{} // myFakeOnce implements Do(func())

Thank you for your suggestion. I have implemented that in the latest commit

I don't see EmitOnce being used in workerUsageCollector. We should only have one (singleton) instance of workerUsageCollector which would be lazily created by the first worker instance. Rest of the workers would create a noOpUsageCollector. This lazy initialization logic should be hidden from workers. Worker just calls newWorkerUsageCollector() and that function should determine whether it's first time or not. Let's discuss offline if more clarification needed.

In our usecase, only the hardware info are once per host collected. Other worker type (decision worker and activity worker) should have different workerUsageCollector as they track different task type behaviors.

what type of information are you planning to collect per worker basis in this workerUsageCollector?

Tasklist backlog/poll response since decision and activity worker have their own pollers and that need to be scaled independently

taylanisikdemir · 2023-11-21T01:05:32Z

internal/internal_worker_usage_collector.go

+					zap.String(tagPanicStack, st))
+			}
+		}()
+		defer w.wg.Done()


there are a few things problematic about this goroutine closure

this wg.Done() will be called once goroutine for go w.runHardwareCollector() is started. It shouldn't be marked as done until runHardwareCollector() terminates so should be moved there

no need for a panic recovery here

no need for a goroutine to invoke runHardwareCollector.

taylanisikdemir · 2023-11-21T01:07:09Z

internal/internal_worker_usage_collector.go

+			case <-ticker.C:
+				// Given that decision worker and activity worker are running in the same host, we only need to collect
+				// hardware usage from one of them.
+				if w.workerType == "DecisionWorker" {


what type of information are you planning to collect per worker basis in this workerUsageCollector?

…ked by sync.Once

taylanisikdemir · 2023-11-23T00:29:46Z

internal/worker.go

+
+		// Optional: This implementation ensures that a specific function is executed only once per instance.
+		// The mechanism can be overridden by other interfaces that implement the 'Do()' method.
+		//
+		// default: nil, that would ensure some functions are executed only once
+		Sync oncePerHost


This is user visible worker options. we shouldn't expose oncePerHost here.

If we do not exposed that here, we might need to insert that as part of the workerExecutionParameters and add the Sync.Once as part of the parameter into function "NewWorker" and "newAggregatedWorker". What do you think about that idea?

it makes sense to add to workerExecutionParameters as it is not exposed outside. I'd also recommend looking at WithSomeOption pattern in go https://www.sohamkamani.com/golang/options-pattern/.

codecov · 2024-05-02T19:28:39Z

Codecov Report

Attention: Patch coverage is 67.01031% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 73.23%. Comparing base (7f81710) to head (822564b).

Additional details and impacted files

Files	Coverage Δ
internal/internal_worker.go	`79.65% <100.00%> (+0.08%)`	⬆️
internal/worker.go	`14.28% <ø> (ø)`
internal/internal_worker_base.go	`75.54% <89.65%> (+9.04%)`	⬆️
internal/internal_worker_usage_collector.go	`53.22% <53.22%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f81710...822564b. Read the comment docs.

Move hardware emit to separate component.

f02ecae

timl3136 force-pushed the worker-utilization branch from 8973aec to f02ecae Compare November 14, 2023 23:46

timl3136 and others added 4 commits November 14, 2023 15:46

Merge branch 'master' into worker-utilization

e57ce8d

add npe check

e8abca3

remove Sync.Once

51cb988

remove Sync.Once and add workertype field

4d28be1

3vilhamster reviewed Nov 16, 2023

View reviewed changes

timl3136 added 2 commits November 16, 2023 10:26

Calculate workflow history size and count and expose that to client (u…

ffd2d75

…ber-go#1270) Enable client side estimated history size exposure via API

Merge branch 'master' into worker-utilization

21fe267

timl3136 marked this pull request as ready for review November 16, 2023 18:44

taylanisikdemir reviewed Nov 16, 2023

View reviewed changes

timl3136 added 8 commits November 16, 2023 11:09

Resolve comments and add a new workerUsageCollectorPanic metric

51f7207

Add Sync.once back and change test so that it won't block testing

e304034

further testing

24b4a84

more

a89107f

Change to shutdownCh instead of ctx.cancel

a9c526f

add ctx.canel back

fa1e190

remove cancel and add logger

3628eb9

move ticker

96e4267

taylanisikdemir reviewed Nov 21, 2023

View reviewed changes

timl3136 added 3 commits November 21, 2023 20:50

add sync.Once into a worker option so that test code will not be bloc…

5092606

…ked by sync.Once

minor change

1a52b18

minor change

7f8a165

taylanisikdemir reviewed Nov 23, 2023

View reviewed changes

Merge branch 'master' into worker-utilization

d0dac1c

timl3136 requested review from Groxx, shijiesheng, agautam478, jakobht and dkrotx as code owners November 25, 2023 21:43

timl3136 requested a review from demirkayaender as a code owner November 25, 2023 21:43

timl3136 and others added 3 commits December 4, 2023 14:31

Merge branch 'master' into worker-utilization

64cddbc

Merge branch 'master' into worker-utilization

cde3ba4

Merge branch 'master' into worker-utilization

822564b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create separate worker usage data collection and move hardware emit there #1293

Create separate worker usage data collection and move hardware emit there #1293

timl3136 commented Nov 14, 2023 •

edited

Loading

3vilhamster Nov 16, 2023

timl3136 Nov 16, 2023

3vilhamster Nov 16, 2023

timl3136 Nov 16, 2023

taylanisikdemir Nov 16, 2023

timl3136 Nov 16, 2023

taylanisikdemir Nov 16, 2023

timl3136 Nov 17, 2023

taylanisikdemir Nov 19, 2023

timl3136 Nov 20, 2023

taylanisikdemir Nov 21, 2023

timl3136 Nov 21, 2023

taylanisikdemir Nov 21, 2023

taylanisikdemir Nov 21, 2023

taylanisikdemir Nov 23, 2023

timl3136 Nov 27, 2023

taylanisikdemir Nov 27, 2023

codecov bot commented May 2, 2024 •

edited

Loading

Create separate worker usage data collection and move hardware emit there #1293

Are you sure you want to change the base?

Create separate worker usage data collection and move hardware emit there #1293

Conversation

timl3136 commented Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 2, 2024 • edited Loading

Codecov Report

timl3136 commented Nov 14, 2023 •

edited

Loading

codecov bot commented May 2, 2024 •

edited

Loading