Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(s3): reducing array copying on S3 output stream #549

Merged
merged 5 commits into from
Aug 6, 2024

Conversation

jeqo
Copy link
Contributor

@jeqo jeqo commented May 22, 2024

Use only ByteRange slices to pass bytes to S3 client operations and remove array copying.

Doing some benchmarks on the current implementation, the Arrays.copyOfRange dominates the memory allocation:

image

By switching to the ByteBuffer approach, this copying is removed:

image

Durations (if empty or more than 1) are not included in the log message.
@jeqo jeqo force-pushed the jeqo/s3-memleak branch 2 times, most recently from d7f6530 to 54ae8de Compare May 23, 2024 07:00
@jeqo jeqo changed the title jeqo/s3 memleak fix(s3): close upload part InputStream May 23, 2024
@jeqo jeqo marked this pull request as ready for review May 23, 2024 07:01
@jeqo jeqo requested a review from a team as a code owner May 23, 2024 07:01
Copy link
Contributor

@funky-eyes funky-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@funky-eyes
Copy link
Contributor

However, in Java, the close method of ByteArrayInputStream has no effect. The methods of this class can be called after the stream has been closed without generating an IOException. This is because the data of ByteArrayInputStream is stored in memory, unlike file streams or network streams that require actual resource cleanup, so there may not be an out-of-memory problem

@jeqo
Copy link
Contributor Author

jeqo commented May 24, 2024

@funky-eyes good catch! I'm adding more changes on how bytes are to the request using byte buffers only instead of array copying. PTAL

@jeqo jeqo changed the title fix(s3): close upload part InputStream fix(s3): reducing array copying on S3 output stream May 26, 2024
Copy link
Contributor

@funky-eyes funky-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines -100 to +104
// TODO: get rid of this array copying
partBuffer.put(source.array(), offset, transferred);
processedBytes += transferred;
source.position(source.position() + transferred);
final ByteBuffer inputBuffer = ByteBuffer.wrap(b, off, len);
while (inputBuffer.hasRemaining()) {
// copy batch to part buffer
final int inputLimit = inputBuffer.limit();
final int toCopy = Math.min(partBuffer.remaining(), inputBuffer.remaining());
final int positionAfterCopying = inputBuffer.position() + toCopy;
inputBuffer.limit(positionAfterCopying);
partBuffer.put(inputBuffer.slice());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main change: remove array copying, and reuse input buffer

Comment on lines -172 to +188
final ByteArrayInputStream in = new ByteArrayInputStream(partBuffer.array(), offset, actualPartSize);
uploadPart(in, actualPartSize);
partBuffer.clear();
try (final InputStream in = new ByteBufferInputStream(buffer)) {
processedBytes += actualPartSize;
uploadPart(in, actualPartSize);
} catch (final IOException e) {
throw new RuntimeException(e);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, provide an inputstream directly from buffer stream instead of creating a new byte array.

@jeqo
Copy link
Contributor Author

jeqo commented Aug 2, 2024

For reference, similar improvements have been implemented on kafka core: apache/kafka#15589

@@ -50,8 +50,8 @@ public class S3StorageTest extends BaseStorageTest {

@BeforeAll
static void setUpClass() {
final var clientBuilder = S3Client.builder();
clientBuilder.region(Region.of(LOCALSTACK.getRegion()))
s3Client = S3Client.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it probably doesn't matter for a test, but i think that s3Client needs to be closed in a @AfterAll method to cleanup eventual connections/resources.

Comment on lines 57 to 64
try (final var out = s3OutputStream(key)) {
final var out = s3OutputStream(key);
try (out) {
inputStream.transferTo(out);
return out.processedBytes();
} catch (final IOException e) {
throw new StorageBackendException("Failed to upload " + key, e);
}
return out.processedBytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not understand why you did this 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to clarify a bit, but the main reason this is now outside the resource try-catch is because the processedBytes in the output stream is counted now as part of the flushBuffer and not the write method anymore.
This is more accurate as it counts bytes processed after upload compared to the previous behavior where it was count when writing.
Before, the processed bytes would be known before closing. Now as the last flushBuffer may occur when closing the stream, it means that the result is not known before closing, thus the change.

writeAllMessages is not validating the request bodies properly, as writesTailMessages does.
Define 2 buffers: 1 for input read data, and the other for part size upload.
Slice them to pass data to IO (part building and part upload).
Copy link
Contributor

@biggusdonzus biggusdonzus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good 👍

@biggusdonzus biggusdonzus merged commit 23792ed into main Aug 6, 2024
9 checks passed
@biggusdonzus biggusdonzus deleted the jeqo/s3-memleak branch August 6, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants