Memory leak in FileInputPlugin with handling of thousands of small files #21
Replies: 5 comments 1 reply
-
@gtnao0219 Thanks for your investigation and report! Yeah, it has been a known unsolved issue, but it hardly reproduced. Therefore, we were not able to decide how it should be handled -- "something that should be handled on the Plugin side", or an issue to be fixed on the core side.
I've not dived deeply into your report yet, but it seems like a very informative investigation. I believe it would help to make decisions, and to fix the problems finally. I'll have a deeper look next week. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
@dmikurube, thank you for your response! I have prepared a repository that can reproduce the issue: https://github.com/gtnao0219/embulk-memory-leak-debug. I understand you're likely quite busy, so please feel free to respond at your convenience. |
Beta Was this translation helpful? Give feedback.
-
Sorry to be late. Looked deeper into the investitation. Your assumption seems correct to me. Yeah, plugins seem to be expected to follow conventions about releasing It'd be a historical problem which came before I took over the maintainance of Embulk. I have no conclusions nor decisions so far then, but I expect we have to rebuild or re-define an explicit convention for plugins, and probably to change some parts of the Embulk core to support the convention. I guess this would be a long-term effort, but we'll have to do it. Anyway, thanks! It's a good progress to identify this. I may ask you for another help when we investigate the issue much deeper. |
Beta Was this translation helpful? Give feedback.
-
Thanks for reading my assumption deeply. I understood that it is probably a problem on the plugin side as I guessed and additional convention is needed to solve the problem fundamentally.
I'm honored. Please let me know in that case. Thanks again for your time! |
Beta Was this translation helpful? Give feedback.
-
I am once again working on this memory leak issue as part of the Embulk core team. Initially, I am investigating the hypothesis previously proposed that Output/FilterPlugins should close its PageReader at the end. However, looking at the heap dump, there are many of the objects whose root of GC is TaskReport. So, I tried again commenting out the part of Embulk core's codes that generates TaskReport, rebuilt it, and run. Next, I will investigate whether it is appropriate to inform a convention on plugin developers that Output/FilterPlugins should close its PageReader. I'll also check whether this doesn't cause any degradations. |
Beta Was this translation helpful? Give feedback.
-
Hi Embulk Core Team,
I've encountered a memory leak issue in a FileInputPlugin-based plugin handling thousands of small files.
It appears that the implementation of FilterPlugin and OutputPlugin can sometimes lead to the netty ByteBuf not being released.
I've seen this kind of problem in previous issues, so sorry if this is a familiar topic.
While it seems like something that should be handled on the Plugin side, I'm curious about the Embulk core team's stance on this.
Here's my understanding of Embulk core and plugins.
Data is exchanged between adjacent plugin tasks via Pages.
While it's not a strict requirement, the common practice among plugins is to use PageBuilder for creating Pages and PageReader for reading them.
The buffer that seems to be causing the leak is allocated by PageBuilder and passed on to next plugin through Pages.
The buffer should be released by either PageBuilder or PageReader.
Following are the code parts where release is called in PageBuilder and PageReader.
PageBuilder
https://github.com/embulk/embulk/blob/8e88e2ee009f5547db7e1558aa4267312d4e97d6/embulk-core/src/main/java/org/embulk/spi/PageBuilderImpl.java#L261-L263
PageReader
https://github.com/embulk/embulk/blob/8e88e2ee009f5547db7e1558aa4267312d4e97d6/embulk-core/src/main/java/org/embulk/spi/PageReaderImpl.java#L35-L36
PageReader
https://github.com/embulk/embulk/blob/8e88e2ee009f5547db7e1558aa4267312d4e97d6/embulk-core/src/main/java/org/embulk/spi/PageReaderImpl.java#L233-L234
Depending on the plugin, finish is usually called before close in PageBuilder at the end of a task.
As a result, the buffer variable becomes null, and the buffer.release() in PageBuilder often doesn't get called.
So, it becomes the responsibility of PageReader (the next plugin) to release the buffer.
However, when I looked at many filter and output plugins, I noticed that very few call close on PageReader at the end of a task.
If there are subsequent Pages, the buffer is released when setPage is called.
But without close call, the buffer for the last Page remains unreleased.
In my case, there were a lot of input files.
Each file was small, but the 32KB buffer for each last Page remained unreleased per file.
Additionally, the more filter plugins that don't call close I insert, the more buffers remain unreleased, increasing with each filter plugin.
Does the core team recommend calling close on PageReader when using it in plugins?
I'm asking because many plugins aren't doing this, I'm curious if my assumption is correct.
Thanks for taking the time to read my message. Looking forward to your response.
Beta Was this translation helpful? Give feedback.
All reactions