-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC JVM runtime metrics proposal #6362
Conversation
For the histogram Cause is good to track but perhaps it could be part of a separate counter ( The memory usage before and after would be awesome to track. |
I'm happy enough to omit the
Yes I agree. Currently that's possible with some prior knowledge of the types of actions your particular garbage collector takes. It would be valuable to users if we could alleviate them of that burden by including the categorization of concurrency in the instrumentation. |
I think that including an additional metrics for capturing the size of memory before and after gc is as useful as the gc time and I think it is worth to include them from the start. With the total MemoryUsageAfterGc is possible to check if the selected Max Heap Size is appropriated for the application. When operating a service It is handy to create an alarm on this metric so that you know when your application is in a trend to crash out of memory or there are regressions or the heap size needs to be adjusted. MemoryUsageAfterGc and MemoryUsageBeforeGc should be ideally emitted as percentage of the max heap size so that customers don't need to adjust alarms after changes in the heap size are performed. |
There was an interesting bug report recently regarding the GC metrics collector in the Prometheus Java library. This should not apply if this instrumentation is only used by the Java agent, but if anyone uses the instrumentation as a library in their Web application the issue applies here as well prometheus/client_java#809 |
GC_KEY, | ||
notificationInfo.getGcName(), | ||
CAUSE_KEY, | ||
notificationInfo.getGcCause(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably trim string to have first N chars only ?
@rapphil do you have ideas on what types of instruments you would use to capture this information, and what dimensions would be included? The
Thanks for the heads up @fstab! Will think about how to avoid this. |
I agree that MemoryUsageAfterGc is very useful. As @rapphil and @jack-berg noticed, it allows to detect 2 unhealthy conditions:
To address these cases, showing MemoryUsageAfterGc as a percentage of the maximum heap size is both sufficient and convenient (as @rapphil proposed). I do not see much value in reporting these numbers after each GC event. Tapping into MemoryPoolMXBean.getCollectionUsage() should provide sufficient information. The main benefit of this API is that it frees us from filtering all lightweight GC events which might not make good enough effort to clear the heap (such as GC-ing only the Survivor or New memory pool, or otherwise making a partial GC effort). On the other hand, I do not see good use for MemoryUsageBeforeGc other than calculation of total-allocated-bytes. |
That seems reasonable to me @PeterF778. And
This would be in line with the existing JVM memory metric conventions. You'd be able to divide it by |
this sounds good to me 👍 |
Will open a separate PR for |
….runtime.jvm.gc.duration histogram (#6964) Replaces #6362. I've reduced the attributes to only record the gc name and the action that was taken (i.e. I've removed the gc cause). If needed we can add the cause later, but for now this should be sufficient to determine total time spent in GC, and categorize time spent as stop the world or parallel.
Garbage collection is one of the important remaining items for the JVM runtime working group. Hoping we can make some progress async, and using this draft PR to collect feedback on a proposal.
The current GC prototype metrics follow the schema:
ms
{collections}
This is ok, but we can do better by hooking into garbage collection notifications, which grants access to details of each garbage collection event. I propose collecting a single histogram with measurements representing the durations of individual gc events:
ms
Some of the types of analysis that are possible with the histogram include:
I've done some testing locally to try to better understand the series that are produced using the attributes described above when different garbage collectors are used. The list isn't exhaustive as my test setup surely didn't exercise all the things that can trigger different types of gc.
G1 Young Generation
, action:end of minor GC
, cause:G1 Evacuation Pause
G1 Young Generation
, action:end of minor GC
, cause:G1 Prevention Collection
G1 Young Generation
, action:end of minor GC
, cause:Metadata GC Threshold
G1 Old Generation
, action:end of major GC
, cause:System.gc()
-Xlog:gc -XX:+UseSerialGC
Copy
, action:end of minor GC
, cause:Allocation Failure
MarkSweepCompact
, action:end of major GC
, cause:Metadata GC Threshold
MarkSweepCompact
, action:end of major GC
, cause:System.gc()
-Xlog:gc -XX:+UseParallelGC
PS Scavenge
, action:end of minor GC
, cause:Allocation Failure
PS Scavenge
, action:end of minor GC
, cause:Metadata GC Threshold
PS Scavenge
, action:end of minor GC
, cause:System.gc()
PS MarkSweep
, action:end of major GC
, cause:Metadata GC Threshold
PS MarkSweep
, action:end of major GC
, cause:System.gc()
-Xlog:gc -XX:+UseZGC
ZGC Cycles
, action:end of GC cycle
, cause:Proactive
ZGC Cycles
, action:end of GC cycle
, cause:Warmup
ZGC Cycles
, action:end of GC cycle
, cause:System.gc()
ZGC Cycles
, action:end of GC cycle
, cause:Metadata GC Threshold
ZGC Pauses
, action:end of GC pause
, cause:Metadata GC Threshold
ZGC Pauses
, action:end of GC pause
, cause:Proactive
ZGC Pauses
, action:end of GC pause
, cause:System.gc()
ZGC Pauses
, action:end of GC pause
, cause:Warmup
-Xlog:gc -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC
Shenandoah Cycles
, action:end of GC cycle
, cause:Concurrent GC
Shenandoah Cycles
, action:end of GC cycle
, cause:System.gc()
Shenandoah Pauses
, action:end of GC pause
, cause:Concurrent GC
Shenandoah Pauses
, action:end of GC pause
, cause:System.gc()
-XX:+PrintGCDetails -XX:+UseParNewGC
ParNew
, action:end of minor GC
, cause:Allocation Failure
MarkSweepCompact
, action:end of major GC
, cause:System.gc
MarkSweepCompact
, action:end of major GC
, cause:Metadata GC Threshold
I've punted on a couple of things that could provide additional value, but IMO aren't needed initially:
Will hold off on a full PR until we reach some sort of consensus. Let me know what you think!