Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch writes in the file sink #20784

Open
jszwedko opened this issue Jul 3, 2024 · 4 comments
Open

Batch writes in the file sink #20784

jszwedko opened this issue Jul 3, 2024 · 4 comments
Labels
sink: file Anything `file` sink related type: feature A value-adding code addition that introduce new functionality.

Comments

@jszwedko
Copy link
Member

jszwedko commented Jul 3, 2024

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

The file sink performance, especially when compressing, seems to be particular poor as evidenced by the investigation in #20739. One culprit of this seems likely to be that the file sink writes (and compresses) one event at a time. I think the throughput is likely to improve through batching writes.

Attempted Solutions

No response

Proposal

Add batch configuration to the file sink and batch writes.

References

Version

v0.39.0

@jszwedko jszwedko added sink: file Anything `file` sink related type: feature A value-adding code addition that introduce new functionality. labels Jul 3, 2024
@johnhtodd
Copy link

It would seem that one event at a time with compression would lower the symbol set massively, leading to very poor compression values. Isn't this what the "buffer" keyword implies? My expectation is that the number of events indicated by "buffer" would be put into a queue, and then compressed/written when that number was exceeded. (This begs the question of "why not a time- or bytes-based buffer flush setting as well?")

@jszwedko
Copy link
Member Author

jszwedko commented Jul 3, 2024

It would seem that one event at a time with compression would lower the symbol set massively, leading to very poor compression values. Isn't this what the "buffer" keyword implies? My expectation is that the number of events indicated by "buffer" would be put into a queue, and then compressed/written when that number was exceeded. (This begs the question of "why not a time- or bytes-based buffer flush setting as well?")

The buffer actually controls the buffer of input events to the sink. I think what you are referring to would commonly be controlled by the batch settings on the sink, but the file sink has no batch settings. There is a diagram that tries to distinguish between how the buffer is independent of sink batching here: https://vector.dev/docs/reference/configuration/sinks/http/#buffers-and-batches. It's admittedly a bit confusing.

@johnhtodd
Copy link

Thanks for the clarification @jszwedko - I didn't quite understand those differences and that is useful for other things I'm working on.

So what you're suggesting is actually adding "batch" parameters to the "file" sink? Does compression happen after buffering, or after batching, and is that consistent in all the sinks?

@jszwedko
Copy link
Member Author

jszwedko commented Jul 3, 2024

So what you're suggesting is actually adding "batch" parameters to the "file" sink? Does compression happen after buffering, or after batching, and is that consistent in all the sinks?

Yeah, that's correct, I'd like to add batch parameters to the file sink to batch the writes. I believe compression happens after batching. I'm pretty sure that's consistent across sinks. It definitely doesn't happen as part of buffering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: file Anything `file` sink related type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

No branches or pull requests

2 participants