GC JVM runtime metrics proposal #6362

jack-berg · 2022-07-21T22:09:45Z

Garbage collection is one of the important remaining items for the JVM runtime working group. Hoping we can make some progress async, and using this draft PR to collect feedback on a proposal.

The current GC prototype metrics follow the schema:

Name	Unit	Unit UCUM	Instrument Type	Value Type	Attribute Key	Attribute Values
runtime.jvm.gc.time	Milliseconds	`ms`	Counter	Int64	gc	Name of garbage collector
runtime.jvm.gc.count	Collections	`{collections}`	Counter	Int64	gc	Name of garbage collector

This is ok, but we can do better by hooking into garbage collection notifications, which grants access to details of each garbage collection event. I propose collecting a single histogram with measurements representing the durations of individual gc events:

Name	Unit	Unit UCUM	Instrument Type	Value Type	Attribute Key	Attribute Values
process.runtime.jvm.gc.time	Milliseconds	`ms`	Histogram	Int64	gc	Name of garbage collector
					cause	The cause of the collection
					action	The collection action performed

Some of the types of analysis that are possible with the histogram include:

Total time spent on garbage collection
Count of garbage collection events
Min, max, average, percentiles of garbage collection time events
The attributes allow you to characterize gc time by the types of events and their cause. With a little prior knowledge you can analyze all gc spent concurrently vs stop the world.

I've done some testing locally to try to better understand the series that are produced using the attributes described above when different garbage collectors are used. The list isn't exhaustive as my test setup surely didn't exercise all the things that can trigger different types of gc.

G1 Garbage Collector (default for many jvms)
- name: G1 Young Generation, action: end of minor GC, cause: G1 Evacuation Pause
- name: G1 Young Generation, action: end of minor GC, cause: G1 Prevention Collection
- name: G1 Young Generation, action: end of minor GC, cause: Metadata GC Threshold
- name: G1 Old Generation, action: end of major GC, cause: System.gc()
Serial Garbage Collector -Xlog:gc -XX:+UseSerialGC
- name: Copy, action: end of minor GC, cause: Allocation Failure
- name: MarkSweepCompact, action: end of major GC, cause: Metadata GC Threshold
- name: MarkSweepCompact, action: end of major GC, cause: System.gc()
Parallel Garbage Collector -Xlog:gc -XX:+UseParallelGC
- name: PS Scavenge, action: end of minor GC, cause: Allocation Failure
- name: PS Scavenge, action: end of minor GC, cause: Metadata GC Threshold
- name: PS Scavenge, action: end of minor GC, cause: System.gc()
- name: PS MarkSweep, action: end of major GC, cause: Metadata GC Threshold
- name: PS MarkSweep, action: end of major GC, cause: System.gc()
Z Garbage Collector -Xlog:gc -XX:+UseZGC
- name: ZGC Cycles, action: end of GC cycle, cause: Proactive
- name: ZGC Cycles, action: end of GC cycle, cause: Warmup
- name: ZGC Cycles, action: end of GC cycle, cause: System.gc()
- name: ZGC Cycles, action: end of GC cycle, cause: Metadata GC Threshold
- name: ZGC Pauses, action: end of GC pause, cause: Metadata GC Threshold
- name: ZGC Pauses, action: end of GC pause, cause: Proactive
- name: ZGC Pauses, action: end of GC pause, cause: System.gc()
- name: ZGC Pauses, action: end of GC pause, cause: Warmup
Shenandoah -Xlog:gc -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC
- name: Shenandoah Cycles, action: end of GC cycle, cause: Concurrent GC
- name: Shenandoah Cycles, action: end of GC cycle, cause: System.gc()
- name: Shenandoah Pauses, action: end of GC pause, cause: Concurrent GC
- name: Shenandoah Pauses, action: end of GC pause, cause: System.gc()
CMS Garbage Collector -XX:+PrintGCDetails -XX:+UseParNewGC
- name: ParNew, action: end of minor GC, cause: Allocation Failure
- name: MarkSweepCompact, action: end of major GC, cause: System.gc
- name: MarkSweepCompact, action: end of major GC, cause: Metadata GC Threshold

I've punted on a couple of things that could provide additional value, but IMO aren't needed initially:

Include an additional attribute indicating whether the collection was concurrent, stop the world, or unknown. Would make it simpler to build dashboards that show the time spent in stop the world collections.
Include an additional metric capturing the size of memory before and after. The gc notification actions give access to GcInfo, which includes information about memory usage before and after collection. Could process this information and record measurements to an instrument (e.g. histogram) representing how much memory was reclaimed by the collection.

Will hold off on a full PR until we reach some sort of consensus. Let me know what you think!

sfriberg · 2022-07-22T18:44:29Z

For the histogram process.runtime.jvm.gc.time I wonder if the cause attribute is needed here. Wonder if the analysis of this will be on all GC independent of cause anyway, and it would help reduce the number of metrics generated significantly. Having the attribute concerning if the action was concurrent or a pause would be much more critical to separate with (guess the action covers this by in-part with the name of the action?).

Cause is good to track but perhaps it could be part of a separate counter (process.runtime.jvm.gc.count ) instead would be enough?

The memory usage before and after would be awesome to track.

jack-berg · 2022-07-28T16:15:03Z

For the histogram process.runtime.jvm.gc.time I wonder if the cause attribute is needed here

I'm happy enough to omit the cause, though I didn't see a significant number of different causes in testing.

Having the attribute concerning if the action was concurrent or a pause would be much more critical

Yes I agree. Currently that's possible with some prior knowledge of the types of actions your particular garbage collector takes. It would be valuable to users if we could alleviate them of that burden by including the categorization of concurrency in the instrumentation.

rapphil · 2022-09-14T16:31:20Z

I think that including an additional metrics for capturing the size of memory before and after gc is as useful as the gc time and I think it is worth to include them from the start. With the total MemoryUsageAfterGc is possible to check if the selected Max Heap Size is appropriated for the application. When operating a service It is handy to create an alarm on this metric so that you know when your application is in a trend to crash out of memory or there are regressions or the heap size needs to be adjusted.

MemoryUsageAfterGc and MemoryUsageBeforeGc should be ideally emitted as percentage of the max heap size so that customers don't need to adjust alarms after changes in the heap size are performed.

fstab · 2022-09-15T16:34:44Z

There was an interesting bug report recently regarding the GC metrics collector in the Prometheus Java library. This should not apply if this instrumentation is only used by the Java agent, but if anyone uses the instrumentation as a library in their Web application the issue applies here as well prometheus/client_java#809

gurudatta-carbon · 2022-09-15T21:31:56Z

...untime-metrics/library/src/main/java/io/opentelemetry/instrumentation/runtimemetrics/Gc.java

+              GC_KEY,
+              notificationInfo.getGcName(),
+              CAUSE_KEY,
+              notificationInfo.getGcCause(),


probably trim string to have first N chars only ?

jack-berg · 2022-09-17T12:30:00Z

MemoryUsageAfterGc and MemoryUsageBeforeGc should be ideally emitted as percentage of the max heap size so that customers don't need to adjust alarms after changes in the heap size are performed.

@rapphil do you have ideas on what types of instruments you would use to capture this information, and what dimensions would be included? The GcInfo class gives access to a map of memory used before and after, where the keys are memory pool names the same as provided by MemoryPoolMXBean#getName() (e.g. G1 Old Gen, G1 Eden space, G1 Survivor space). What's the best way to record to an instrument to enable the type of analysis you're envisioning? I think for it to be useful we need to filter out gc events that aren't stop the world, and record to async up down counters. The non-monotonic cumulative sums emitted would represent the "latest" memory used before and after GC. If the latest memory used after GC is trending upwards, it might be an indication that you have a memory leak.

There was an interesting bug report recently regarding the GC metrics collector in the Prometheus Java library.

Thanks for the heads up @fstab! Will think about how to avoid this.

PeterF778 · 2022-09-21T00:21:24Z

I think for it to be useful we need to filter out gc events that aren't stop the world, and record to async up down counters. The non-monotonic cumulative sums emitted would represent the "latest" memory used before and after GC. If the latest memory used after GC is trending upwards, it might be an indication that you have a memory leak.

I agree that MemoryUsageAfterGc is very useful. As @rapphil and @jack-berg noticed, it allows to detect 2 unhealthy conditions:

Memory leak
Inappropriate maximum heap size

To address these cases, showing MemoryUsageAfterGc as a percentage of the maximum heap size is both sufficient and convenient (as @rapphil proposed). I do not see much value in reporting these numbers after each GC event. Tapping into MemoryPoolMXBean.getCollectionUsage() should provide sufficient information. The main benefit of this API is that it frees us from filtering all lightweight GC events which might not make good enough effort to clear the heap (such as GC-ing only the Survivor or New memory pool, or otherwise making a partial GC effort).

On the other hand, I do not see good use for MemoryUsageBeforeGc other than calculation of total-allocated-bytes.

jack-berg · 2022-09-29T21:43:59Z

To address these cases, showing MemoryUsageAfterGc as a percentage of the maximum heap size is both sufficient and convenient (as @rapphil proposed). I do not see much value in reporting these numbers after each GC event. Tapping into MemoryPoolMXBean.getCollectionUsage() should provide sufficient information.

That seems reasonable to me @PeterF778. And MemoryPoolMXBean.getCollectionUsage() does seem to fit the bill. I would propose we extend the MemoryPools instrumentation to collect the post-GC usage of each pool, and report it as follows:

| Name                                       | Description                                              | Unit    | Unit ([UCUM](README.md#instrument-units)) | Instrument Type ([*](README.md#instrument-types)) | Value Type | Attribute Key | Attribute Values      |
|--------------------------------------------|----------------------------------------------------------|---------|-------------------------------------------|---------------------------------------------------|------------|---------------|-----------------------|
| process.runtime.jvm.memory.usage_after_gc           | Measure of memory used after the most recent garbage collection event on this pool | Bytes   | `By`                                      | UpDownCounter                                     | Int64      | type          | `"heap"`, `"nonheap"` |
|                                            |                                                          |         |                                           |                                                   |            | pool          | Name of pool [1]      |

This would be in line with the existing JVM memory metric conventions.

You'd be able to divide it by process.runtime.jvm.memory.limit to compute the percentage as a function of the pool's limit.

trask · 2022-10-23T04:21:15Z

That seems reasonable to me @PeterF778. And MemoryPoolMXBean.getCollectionUsage() does seem to fit the bill. I would propose we extend the MemoryPools instrumentation to collect the post-GC usage of each pool, and report it as follows:

| Name                                       | Description                                              | Unit    | Unit ([UCUM](README.md#instrument-units)) | Instrument Type ([*](README.md#instrument-types)) | Value Type | Attribute Key | Attribute Values      |
|--------------------------------------------|----------------------------------------------------------|---------|-------------------------------------------|---------------------------------------------------|------------|---------------|-----------------------|
| process.runtime.jvm.memory.usage_after_gc           | Measure of memory used after the most recent garbage collection event on this pool | Bytes   | `By`                                      | UpDownCounter                                     | Int64      | type          | `"heap"`, `"nonheap"` |
|                                            |                                                          |         |                                           |                                                   |            | pool          | Name of pool [1]      |

This would be in line with the existing JVM memory metric conventions.

You'd be able to divide it by process.runtime.jvm.memory.limit to compute the percentage as a function of the pool's limit.

this sounds good to me 👍

jack-berg · 2022-10-23T14:14:02Z

Will open a separate PR for process.runtime.jvm.memory.usage_after_gc, and a corresponding PR to the spec.

jack-berg · 2022-10-24T21:44:43Z

Replaced by #6964 and #6963.

Per conversation in #6362.

….runtime.jvm.gc.duration histogram (#6964) Replaces #6362. I've reduced the attributes to only record the gc name and the action that was taken (i.e. I've removed the gc cause). If needed we can add the cause later, but for now this should be sufficient to determine total time spent in GC, and categorize time spent as stop the world or parallel.

Gc time histogram

e3d15fc

mateuszrzeszutek linked an issue Sep 5, 2022 that may be closed by this pull request

Add JVM Buffer, JVM GC Metrics #6280

Closed

mateuszrzeszutek mentioned this pull request Sep 5, 2022

Add JVM Buffer, JVM GC Metrics #6280

Closed

gurudatta-carbon reviewed Sep 15, 2022

View reviewed changes

This was referenced Oct 24, 2022

Record memory usage after garbage collection #6963

Merged

Track JVM memory usage after GC open-telemetry/opentelemetry-specification#2900

Closed

Replace runtime.jvm.gc.time/runtime.jvm.gc.count metrics with process.runtime.jvm.gc.duration histogram #6964

Merged

jack-berg closed this Oct 24, 2022

jack-berg mentioned this pull request Oct 24, 2022

Track JVM gc time open-telemetry/opentelemetry-specification#2902

Closed

trask pushed a commit that referenced this pull request Nov 8, 2022

Record memory usage after garbage collection (#6963)

177d9cd

Per conversation in #6362.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC JVM runtime metrics proposal #6362

GC JVM runtime metrics proposal #6362

jack-berg commented Jul 21, 2022

sfriberg commented Jul 22, 2022

jack-berg commented Jul 28, 2022

rapphil commented Sep 14, 2022 •

edited

Loading

fstab commented Sep 15, 2022 •

edited

Loading

gurudatta-carbon Sep 15, 2022

jack-berg commented Sep 17, 2022

PeterF778 commented Sep 21, 2022

jack-berg commented Sep 29, 2022

trask commented Oct 23, 2022

jack-berg commented Oct 23, 2022

jack-berg commented Oct 24, 2022

GC JVM runtime metrics proposal #6362

GC JVM runtime metrics proposal #6362

Conversation

jack-berg commented Jul 21, 2022

sfriberg commented Jul 22, 2022

jack-berg commented Jul 28, 2022

rapphil commented Sep 14, 2022 • edited Loading

fstab commented Sep 15, 2022 • edited Loading

gurudatta-carbon Sep 15, 2022

Choose a reason for hiding this comment

jack-berg commented Sep 17, 2022

PeterF778 commented Sep 21, 2022

jack-berg commented Sep 29, 2022

trask commented Oct 23, 2022

jack-berg commented Oct 23, 2022

jack-berg commented Oct 24, 2022

rapphil commented Sep 14, 2022 •

edited

Loading

fstab commented Sep 15, 2022 •

edited

Loading