[Performance] Observing higher memory spikes in C++ when running multiple Inference `Run()` executions on CPU #22920

martinkorelic · 2024-11-21T19:52:54Z

Describe the issue

Description:

I am observing high memory spikes after each run of the inference session when passing the previous outputs of the inference to the input of another iteration of the inference. This happens when the input values change during each iteration of the generation loop. The memory usage increases significantly after every Run() invocation, and the allocated memory becomes larger and larger. In contrast, the Python version of my code does not exhibit these spikes, and memory usage remains more or less linearly stable. This suggests that there may be an issue related to memory management in the C++ version, possibly tied to the memory arena, memory patterns, or session configuration.

Environment:

ONNX Runtime API version: 18
Execution Provider: ONNX Runtime CPU Execution Provider
Platform: Android (debugging in Android environment)
Language: C++

What I've Tried:

Memory Arena & Arena Configuration:
- Disabled memory arena and memory patterns (but the issue persists).
- Created ArenaCfg and registered a custom allocator, but this did not solve the problem.
Session Options:
- Set inter_op_num_threads and intra_op_num_threads to 1.
- Used session.use_device_allocator_for_initializers = 1 in session options.
- Attempted to enable memory arena shrinkage on cpu:0 for session run, but the high memory allocation continues.
Input & Output Memory Size:
- I’ve checked the input and output tensors and confirmed that they are not taking up excessive space.

Additional Information:

I have tried several configurations and approaches, including modifying session options and attempting to handle memory release explicitly, but the issue remains unresolved.

Question:

How can I better manage memory usage during inference sessions when using a dynamically changing inputs in ONNX Runtime for the C++ API? Are there specific settings or techniques for reducing memory spikes that I may have missed? The same implementation does not persist in python version.

To reproduce

Use InferenceSession.Run() multiple times in a loop, perhaps by using dynamic increasing input shapes. Quantized ONNX model exported from torch.

Observations:

The inputs and outputs of the inference session themselves are not excessively large. The high memory allocation seems to occur after each Run().

Expected Behavior:

Memory usage and increase should remain more or less stable across inference runs, with linear memory increase allocation after each Run().

Actual Behavior:

Memory usage increases significantly after every Run() invocation. The memory allocated grows larger with each iteration and not linear.

Urgency

No response

Platform

Android

OS Version

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

ONNX Runtime 18

ONNX Runtime API

C++

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

The text was updated successfully, but these errors were encountered:

skottmckay · 2024-11-21T22:56:29Z

Are you sure you're managing memory correctly in your code? You mention passing outputs from one run to be inputs to the next. How is that memory freed once used as inputs?

We do this kind of execution for LLMs on Android without seeing constant memory growth.

Python will automatically reference count for you and handle the memory. I believe the Java OnnxTensor needs close() to be called.

onnxruntime/java/src/main/java/ai/onnxruntime/OnnxTensor.java

Lines 273 to 278 in f6e1d44

    
             /** 
        
              * Closes the tensor, releasing its underlying memory (if it's not backed by an NIO buffer). If it 
        
              * is backed by a buffer then the memory is released when the buffer is GC'd. 
        
              */ 
        
             @Override 
        
             public synchronized void close() {

martinkorelic · 2024-11-22T09:37:24Z

@skottmckay
I made a double check of whats happening with the outputs and it seems I was not moving it to my ownership, which I think made them accumulate overtime. I was following this tutorial where the outputs were extracted like so:

float *output = output_values.front().GetTensorMutableData<float>();

and then returning the output, but it seems I have fixed the issue by taking the ownership myself like so:

std::unique_ptr<Ort::Value> output = std::make_unique<Ort::Value>(std::move(output_values.front()));

which solved the "memory leak" issue. I was also clearing the outputs and inputs vectors with .clear(), which I think was suitable to do to release memory.

One thing I noticed that at some point after a few iterations if I disabled memory cpu arena and memory pattern, I was getting a memory segmentation fault with extracting this output. By enabling it again I was able to avoid this, not sure why this is exactly happening, maybe there wasn't enough memory allocation for the outputs after some time by disabling the cpu mem arena. If you perhaps have an explanation for that issue that would be nice, otherwise I think we can close this issue for now.

martinkorelic added the performance issues related to performance regressions label Nov 21, 2024

github-actions bot added the platform:mobile issues related to ONNX Runtime mobile; typically submitted using template label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Observing higher memory spikes in C++ when running multiple Inference `Run()` executions on CPU #22920

[Performance] Observing higher memory spikes in C++ when running multiple Inference `Run()` executions on CPU #22920

martinkorelic commented Nov 21, 2024

skottmckay commented Nov 21, 2024

martinkorelic commented Nov 22, 2024

[Performance] Observing higher memory spikes in C++ when running multiple Inference Run() executions on CPU #22920

[Performance] Observing higher memory spikes in C++ when running multiple Inference Run() executions on CPU #22920

Comments

martinkorelic commented Nov 21, 2024

Describe the issue

Description:

Environment:

What I've Tried:

Additional Information:

Question:

To reproduce

Observations:

Expected Behavior:

Actual Behavior:

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

skottmckay commented Nov 21, 2024

martinkorelic commented Nov 22, 2024

[Performance] Observing higher memory spikes in C++ when running multiple Inference `Run()` executions on CPU #22920

[Performance] Observing higher memory spikes in C++ when running multiple Inference `Run()` executions on CPU #22920