Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel Java Agent Causing Heap Memory Leak Issue #12303

Open
vanilla-sundae opened this issue Sep 20, 2024 · 2 comments
Open

Otel Java Agent Causing Heap Memory Leak Issue #12303

vanilla-sundae opened this issue Sep 20, 2024 · 2 comments
Labels
bug Something isn't working needs triage New issue that requires triage

Comments

@vanilla-sundae
Copy link

Describe the bug

Context

My service uses Otel Java agent published by this library https://github.com/aws-observability/aws-otel-java-instrumentation
. with annotations @WithSpan and @SpanAttribute (https://opentelemetry.io/docs/zero-code/java/agent/annotations/) in the code to get traces for our requests.

Problem Statement

Otel Java agent was set up correctly, and no memory issue with initial setup. However, it's after we add annotations @WithSpan and @SpanAttribute to the service code that we started to see a periodic memory increase issue (JVM metric HeapMemoryAfterGCUse increased to almost 100%) with a lot of otel objects created on the heap, and we have to bounce our hosts to mitigate it.

Otel objects we saw are mainly io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey and io.opentelemetry.javaagent.bootstrap.executors.PropagatedContext, as well as java objects java.util.concurrent.ConcurrentHashMap$Node and java.lang.ref.WeakReference

We added @WithSpan to methods executed by child threads and virtual threads, not sure if that would be a concern. But we are able to view traces for these methods correctly.

Here's our heap dump result:

Histogram:
Screenshot 2024-09-19 at 4 55 18 PM

Memory Leak Suspect Report:
Screenshot 2024-09-19 at 4 56 50 PM
Screenshot 2024-09-19 at 4 57 16 PM
Screenshot 2024-09-19 at 4 57 29 PM

Ask

Can anyone help with this issue and let us know what the root cause could be?

Steps to reproduce

We set up java agent in our service docker image file:

ADD https://github.com/aws-observability/aws-otel-java-instrumentation/releases/latest/download/aws-opentelemetry-agent.jar /opt/aws-opentelemetry-agent.jar
RUN chmod 644 /opt/aws-opentelemetry-agent.jar
ENV JAVA_TOOL_OPTIONS="-javaagent:/opt/aws-opentelemetry-agent.jar"
ENV OTEL_RESOURCE_ATTRIBUTES="service.name=XXX,service.namespace=XXX"
ENV OTEL_PROPAGATORS="tracecontext,baggage,xray"
ENV OTEL_TRACES_SAMPLER="traceidratio"
ENV OTEL_TRACES_SAMPLER_ARG="0.00001"
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"

And we add @WithSpan to methods and @SpanAttribute to one of the arguments.

@WithSpan
public void myMethod(@SpanAttribute SomeClass someObject) {
      <...>
}

Expected behavior

No or minimum impact on heap memory usage.

Actual behavior

Heap memory usage after GC increase to 100% if we don't bounce the hosts.

Javaagent or library instrumentation version

v1.32.3

Environment

JDK: JDK21
OS: Linux x86_64

Additional context

No response

@vanilla-sundae vanilla-sundae added bug Something isn't working needs triage New issue that requires triage labels Sep 20, 2024
@leon4652
Copy link

leon4652 commented Sep 20, 2024

안녕하세요. 저 역시도 유사한 문제를 겪었고 이를 트러블슈팅하고 있는 사용자입니다. 제 해결 경험이 도움이 되고자 하여 이를 정리하였습니다.
저 역시 지속적으로 Batch 작업 발생 시 동일하게 Old Gen에 메모리가 적재되었고 이 메모리가 해제되지 않았던 문제를 겪었습니다.

상황

저의 경우는 Java Agent로 자동 계측되는 Span 외에도 Agent의 Extension을 통해 별도의 Span과 Attributes를 수집하였습니다.
데이터 수집이 이루어지며 점진적으로 메모리 누수가 증가했고, 서버 프로덕션 이후 대부분의 할당된 heap을 old gen이 차지하며 서버가 OOM에 빠져 멈춰버렸습니다. (GC가 이루어지며 해소되는 것으로 보이나, 새롭게 증가되는 Span이 바로 Old Gen에 할당됩니다.)

원인

제가 추정한 원인과 해결은 다음과 같습니다. (문제가 해결되었으나, 정확하지는 않으므로 당신의 판단이 필요합니다.)
만일 당신이 수집한 Metric에서 Old Gen이 지속적으로 증가하는 것을 확인하셨다면 아마 저와 비슷한 케이스로 의심됩니다.

  1. SpanProcessor에 의해 수집되는 'Span' 객체 자체의 크기가 매우 커졌고, 배치 주기 사이에 저장되는 span들이 문제를 일으킵니다.
  2. JVM의 Eden Space에서 이를 적재할 수 없어 조기 프로모션이 발생하였고 메모리 누수가 발생하게 됩니다.
  3. 이는 기본적으로 Span의 크기가 방대하게 수집되기에 생할 수 있는 일이며 (저의 경우는 Json으로 변환하였을때 하나의 Span당 대략 100KB~ 이상이었습니다.) 적게 할당되는 Eden의 메모리 용량이 버티지 못해 Old Gen으로 premature promotion되는 문제로 판단하였습니다.
  4. 제 생각에, @WithSpan을 사용하는 만큼, 이 부분에 대한 수집 데이터를 확인하는 것을 추천드립니다.

당신의 빠른 문제 해결을 기원합니다.


Hello,

I have experienced a similar issue and have been troubleshooting it myself. I’ve compiled my resolution experience in the hope that it may be helpful. Like you, I encountered the problem where memory was continuously allocated to the Old Gen during batch processing and was not being released.

Situation

In my case, I collected additional Spans and Attributes through the Agent’s Extension, aside from the Spans automatically instrumented by the Java Agent. As data collection occurred, memory leakage gradually increased, and after the server went into production, most of the allocated heap was occupied by the Old Gen, causing the server to crash with an OutOfMemoryError (OOM). Although it seemed like the issue was resolved temporarily by garbage collection (GC), newly generated Spans were immediately allocated to the Old Gen.

Cause

The cause and solution I identified are as follows (the issue was resolved in my case, but the accuracy is not guaranteed, so you should verify it yourself). If you observe a continuous increase in the Old Gen in the metrics you are collecting, it might be a similar case to mine.

  1. The size of the 'Span' objects collected by the SpanProcessor had grown significantly, and the Spans stored between batch cycles were causing the issue.
  2. The JVM’s Eden Space was unable to accommodate these objects, leading to early promotion, which resulted in memory leakage.
  3. This issue likely arose because of the large size of the collected Spans. (In my case, when converted to JSON, each Span was approximately 100KB or larger.) The memory allocated to Eden was insufficient to handle this, causing premature promotion to the Old Gen.
  4. I recommend reviewing the data collection associated with this, especially if you are using @WithSpan.

I hope this helps you resolve your issue quickly.

@laurit
Copy link
Contributor

laurit commented Sep 20, 2024

@vanilla-sundae thanks for reporting, unfortunately the information provided is not enough to understand and fix the issue. You should examine the heap dump and try to answer the following.

  • which map it is that is large
  • what kind of keys it has
  • why are there so many of them
  • can the keys be collected
  • if not, what keeps them alive
    If you are not capable of analyzing the heap dump yourself you could turn to your otel vendor and see if they can help you.
    Alternatively if you could provide a minimal application that reproduces the issue then that would also be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New issue that requires triage
Projects
None yet
Development

No branches or pull requests

3 participants