Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC JVM runtime metrics proposal #6362

Closed
wants to merge 1 commit into from

Conversation

jack-berg
Copy link
Member

Garbage collection is one of the important remaining items for the JVM runtime working group. Hoping we can make some progress async, and using this draft PR to collect feedback on a proposal.

The current GC prototype metrics follow the schema:

Name Unit Unit UCUM Instrument Type Value Type Attribute Key Attribute Values
runtime.jvm.gc.time Milliseconds ms Counter Int64 gc Name of garbage collector
runtime.jvm.gc.count Collections {collections} Counter Int64 gc Name of garbage collector

This is ok, but we can do better by hooking into garbage collection notifications, which grants access to details of each garbage collection event. I propose collecting a single histogram with measurements representing the durations of individual gc events:

Name Unit Unit UCUM Instrument Type Value Type Attribute Key Attribute Values
process.runtime.jvm.gc.time Milliseconds ms Histogram Int64 gc Name of garbage collector
cause The cause of the collection
action The collection action performed

Some of the types of analysis that are possible with the histogram include:

  • Total time spent on garbage collection
  • Count of garbage collection events
  • Min, max, average, percentiles of garbage collection time events
  • The attributes allow you to characterize gc time by the types of events and their cause. With a little prior knowledge you can analyze all gc spent concurrently vs stop the world.

I've done some testing locally to try to better understand the series that are produced using the attributes described above when different garbage collectors are used. The list isn't exhaustive as my test setup surely didn't exercise all the things that can trigger different types of gc.

  • G1 Garbage Collector (default for many jvms)
    • name: G1 Young Generation, action: end of minor GC, cause: G1 Evacuation Pause
    • name: G1 Young Generation, action: end of minor GC, cause: G1 Prevention Collection
    • name: G1 Young Generation, action: end of minor GC, cause: Metadata GC Threshold
    • name: G1 Old Generation, action: end of major GC, cause: System.gc()
  • Serial Garbage Collector -Xlog:gc -XX:+UseSerialGC
    • name: Copy, action: end of minor GC, cause: Allocation Failure
    • name: MarkSweepCompact, action: end of major GC, cause: Metadata GC Threshold
    • name: MarkSweepCompact, action: end of major GC, cause: System.gc()
  • Parallel Garbage Collector -Xlog:gc -XX:+UseParallelGC
    • name: PS Scavenge, action: end of minor GC, cause: Allocation Failure
    • name: PS Scavenge, action: end of minor GC, cause: Metadata GC Threshold
    • name: PS Scavenge, action: end of minor GC, cause: System.gc()
    • name: PS MarkSweep, action: end of major GC, cause: Metadata GC Threshold
    • name: PS MarkSweep, action: end of major GC, cause: System.gc()
  • Z Garbage Collector -Xlog:gc -XX:+UseZGC
    • name: ZGC Cycles, action: end of GC cycle, cause: Proactive
    • name: ZGC Cycles, action: end of GC cycle, cause: Warmup
    • name: ZGC Cycles, action: end of GC cycle, cause: System.gc()
    • name: ZGC Cycles, action: end of GC cycle, cause: Metadata GC Threshold
    • name: ZGC Pauses, action: end of GC pause, cause: Metadata GC Threshold
    • name: ZGC Pauses, action: end of GC pause, cause: Proactive
    • name: ZGC Pauses, action: end of GC pause, cause: System.gc()
    • name: ZGC Pauses, action: end of GC pause, cause: Warmup
  • Shenandoah -Xlog:gc -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC
    • name: Shenandoah Cycles, action: end of GC cycle, cause: Concurrent GC
    • name: Shenandoah Cycles, action: end of GC cycle, cause: System.gc()
    • name: Shenandoah Pauses, action: end of GC pause, cause: Concurrent GC
    • name: Shenandoah Pauses, action: end of GC pause, cause: System.gc()
  • CMS Garbage Collector -XX:+PrintGCDetails -XX:+UseParNewGC
    • name: ParNew, action: end of minor GC, cause: Allocation Failure
    • name: MarkSweepCompact, action: end of major GC, cause: System.gc
    • name: MarkSweepCompact, action: end of major GC, cause: Metadata GC Threshold

I've punted on a couple of things that could provide additional value, but IMO aren't needed initially:

  • Include an additional attribute indicating whether the collection was concurrent, stop the world, or unknown. Would make it simpler to build dashboards that show the time spent in stop the world collections.
  • Include an additional metric capturing the size of memory before and after. The gc notification actions give access to GcInfo, which includes information about memory usage before and after collection. Could process this information and record measurements to an instrument (e.g. histogram) representing how much memory was reclaimed by the collection.

Will hold off on a full PR until we reach some sort of consensus. Let me know what you think!

@sfriberg
Copy link
Contributor

For the histogram process.runtime.jvm.gc.time I wonder if the cause attribute is needed here. Wonder if the analysis of this will be on all GC independent of cause anyway, and it would help reduce the number of metrics generated significantly. Having the attribute concerning if the action was concurrent or a pause would be much more critical to separate with (guess the action covers this by in-part with the name of the action?).

Cause is good to track but perhaps it could be part of a separate counter (process.runtime.jvm.gc.count ) instead would be enough?

The memory usage before and after would be awesome to track.

@jack-berg
Copy link
Member Author

For the histogram process.runtime.jvm.gc.time I wonder if the cause attribute is needed here

I'm happy enough to omit the cause, though I didn't see a significant number of different causes in testing.

Having the attribute concerning if the action was concurrent or a pause would be much more critical

Yes I agree. Currently that's possible with some prior knowledge of the types of actions your particular garbage collector takes. It would be valuable to users if we could alleviate them of that burden by including the categorization of concurrency in the instrumentation.

@mateuszrzeszutek mateuszrzeszutek linked an issue Sep 5, 2022 that may be closed by this pull request
@rapphil
Copy link
Contributor

rapphil commented Sep 14, 2022

I think that including an additional metrics for capturing the size of memory before and after gc is as useful as the gc time and I think it is worth to include them from the start. With the total MemoryUsageAfterGc is possible to check if the selected Max Heap Size is appropriated for the application. When operating a service It is handy to create an alarm on this metric so that you know when your application is in a trend to crash out of memory or there are regressions or the heap size needs to be adjusted.

MemoryUsageAfterGc and MemoryUsageBeforeGc should be ideally emitted as percentage of the max heap size so that customers don't need to adjust alarms after changes in the heap size are performed.

@fstab
Copy link
Member

fstab commented Sep 15, 2022

There was an interesting bug report recently regarding the GC metrics collector in the Prometheus Java library. This should not apply if this instrumentation is only used by the Java agent, but if anyone uses the instrumentation as a library in their Web application the issue applies here as well prometheus/client_java#809

GC_KEY,
notificationInfo.getGcName(),
CAUSE_KEY,
notificationInfo.getGcCause(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably trim string to have first N chars only ?

@jack-berg
Copy link
Member Author

MemoryUsageAfterGc and MemoryUsageBeforeGc should be ideally emitted as percentage of the max heap size so that customers don't need to adjust alarms after changes in the heap size are performed.

@rapphil do you have ideas on what types of instruments you would use to capture this information, and what dimensions would be included? The GcInfo class gives access to a map of memory used before and after, where the keys are memory pool names the same as provided by MemoryPoolMXBean#getName() (e.g. G1 Old Gen, G1 Eden space, G1 Survivor space). What's the best way to record to an instrument to enable the type of analysis you're envisioning? I think for it to be useful we need to filter out gc events that aren't stop the world, and record to async up down counters. The non-monotonic cumulative sums emitted would represent the "latest" memory used before and after GC. If the latest memory used after GC is trending upwards, it might be an indication that you have a memory leak.

There was an interesting bug report recently regarding the GC metrics collector in the Prometheus Java library.

Thanks for the heads up @fstab! Will think about how to avoid this.

@PeterF778
Copy link
Contributor

I think for it to be useful we need to filter out gc events that aren't stop the world, and record to async up down counters. The non-monotonic cumulative sums emitted would represent the "latest" memory used before and after GC. If the latest memory used after GC is trending upwards, it might be an indication that you have a memory leak.

I agree that MemoryUsageAfterGc is very useful. As @rapphil and @jack-berg noticed, it allows to detect 2 unhealthy conditions:

  1. Memory leak
  2. Inappropriate maximum heap size

To address these cases, showing MemoryUsageAfterGc as a percentage of the maximum heap size is both sufficient and convenient (as @rapphil proposed). I do not see much value in reporting these numbers after each GC event. Tapping into MemoryPoolMXBean.getCollectionUsage() should provide sufficient information. The main benefit of this API is that it frees us from filtering all lightweight GC events which might not make good enough effort to clear the heap (such as GC-ing only the Survivor or New memory pool, or otherwise making a partial GC effort).

On the other hand, I do not see good use for MemoryUsageBeforeGc other than calculation of total-allocated-bytes.

@jack-berg
Copy link
Member Author

To address these cases, showing MemoryUsageAfterGc as a percentage of the maximum heap size is both sufficient and convenient (as @rapphil proposed). I do not see much value in reporting these numbers after each GC event. Tapping into MemoryPoolMXBean.getCollectionUsage() should provide sufficient information.

That seems reasonable to me @PeterF778. And MemoryPoolMXBean.getCollectionUsage() does seem to fit the bill. I would propose we extend the MemoryPools instrumentation to collect the post-GC usage of each pool, and report it as follows:

| Name                                       | Description                                              | Unit    | Unit ([UCUM](README.md#instrument-units)) | Instrument Type ([*](README.md#instrument-types)) | Value Type | Attribute Key | Attribute Values      |
|--------------------------------------------|----------------------------------------------------------|---------|-------------------------------------------|---------------------------------------------------|------------|---------------|-----------------------|
| process.runtime.jvm.memory.usage_after_gc           | Measure of memory used after the most recent garbage collection event on this pool | Bytes   | `By`                                      | UpDownCounter                                     | Int64      | type          | `"heap"`, `"nonheap"` |
|                                            |                                                          |         |                                           |                                                   |            | pool          | Name of pool [1]      |

This would be in line with the existing JVM memory metric conventions.

You'd be able to divide it by process.runtime.jvm.memory.limit to compute the percentage as a function of the pool's limit.

@trask
Copy link
Member

trask commented Oct 23, 2022

That seems reasonable to me @PeterF778. And MemoryPoolMXBean.getCollectionUsage() does seem to fit the bill. I would propose we extend the MemoryPools instrumentation to collect the post-GC usage of each pool, and report it as follows:

| Name                                       | Description                                              | Unit    | Unit ([UCUM](README.md#instrument-units)) | Instrument Type ([*](README.md#instrument-types)) | Value Type | Attribute Key | Attribute Values      |
|--------------------------------------------|----------------------------------------------------------|---------|-------------------------------------------|---------------------------------------------------|------------|---------------|-----------------------|
| process.runtime.jvm.memory.usage_after_gc           | Measure of memory used after the most recent garbage collection event on this pool | Bytes   | `By`                                      | UpDownCounter                                     | Int64      | type          | `"heap"`, `"nonheap"` |
|                                            |                                                          |         |                                           |                                                   |            | pool          | Name of pool [1]      |

This would be in line with the existing JVM memory metric conventions.

You'd be able to divide it by process.runtime.jvm.memory.limit to compute the percentage as a function of the pool's limit.

this sounds good to me 👍

@jack-berg
Copy link
Member Author

Will open a separate PR for process.runtime.jvm.memory.usage_after_gc, and a corresponding PR to the spec.

@jack-berg
Copy link
Member Author

Replaced by #6964 and #6963.

@jack-berg jack-berg closed this Oct 24, 2022
trask pushed a commit that referenced this pull request Nov 8, 2022
trask pushed a commit that referenced this pull request Nov 15, 2022
….runtime.jvm.gc.duration histogram (#6964)

Replaces #6362.

I've reduced the attributes to only record the gc name and the action
that was taken (i.e. I've removed the gc cause). If needed we can add
the cause later, but for now this should be sufficient to determine
total time spent in GC, and categorize time spent as stop the world or
parallel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add JVM Buffer, JVM GC Metrics
7 participants