Skip to content

Commit

Permalink
Mem limiter (#368)
Browse files Browse the repository at this point in the history
Co-authored-by: Michael Graeb <[email protected]>
  • Loading branch information
DmitriyMusatkin and graebm authored Nov 21, 2023
1 parent 5fe1c3b commit f961971
Show file tree
Hide file tree
Showing 19 changed files with 1,152 additions and 78 deletions.
76 changes: 76 additions & 0 deletions docs/memory_aware_request_execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
CRT S3 client was designed with throughput as a primary goal. As such, the client
scales resource usage, such as number of parallel requests in flight, to achieve
target throughput. The client creates buffers to hold data it is sending or
receiving for each request and scaling requests in flight has direct impact on
memory used. In practice, setting high target throughput or larger part size can
lead to high observed memory usage.

To mitigate high memory usages, memory reuse improvements were recently added to
the client along with options to limit max memory used. The following sections
will go into more detail on aspects of those changes and how the affect the
client.

### Memory Reuse
At the basic level, CRT S3 client starts with a meta request for operation like
put or get, breaks it into smaller part-sized requests and executes those in
parallel. CRT S3 client used to allocate part sized buffer for each of those
requests and release it right after the request was done. That approach,
resulted in a lot of very short lived allocations and allocator thrashing,
overall leading to memory use spikes considerably higher than whats needed. To
address that, the client is switching to a pooled buffer approach, discussed
below.

Note: approach described below is work in progress and concentrates on improving
the common cases (default 8mb part sizes and part sizes smaller than 64mb).

Several observations about the client usage of buffers:
- Client does not automatically switch to buffers above default 8mb for upload, until
upload passes 10,000 parts (~80 GB).
- Get operations always use either the configured part size or default of 8mb.
Part size for get is not adjusted, since there is no 10,000 part limitation.
- Both Put and Get operations go through fill and drain phases. Ex. for Put, the
client first schedules a number of reads to 'fil' the buffers from the source
and as those reads complete, the buffer are send over to the networking layer
are 'drained'
- individual uploadParts or ranged gets operations typically have a similar
lifespan (with some caveats). in practice part buffers are acquired/released
in bulk at the same time

The buffer pooling takes advantage of some of those allocation patterns and
works as follows.
The memory is split into primary and secondary areas. Secondary area is used for
requests with part size bigger than a predefined value (currently 4 times part size)
allocations from it got directly to allocator and are effectively old way of
doing things.

Primary memory area is split into blocks of fixed size (part size if defined or
8mb if not times 16). Blocks are allocated on demand. Each block is logically
subdivided into part sized chunks. Pool allocates and releases in chunk sizes
only, and supports acquiring several chunks (up to 4) at once.

Blocks are kept around while there are ongoing requests and are released async,
when there is low pressure on memory.

### Scheduling
Running out of memory is a terminal condition within CRT and in general its not
practical to try to set overall memory limit on all allocations, since it
dramatically increases the complexity of the code that deals with cases where
only part of a memory was allocated for a task.

Comparatively, majority of memory usage within S3 Client comes from buffers
allocated for Put/Get parts. So to control memory usage, the client will
concentrate on controlling the number of buffers allocated. Effectively, this
boils down to a back pressure mechanism of limiting the number of parts
scheduled as memory gets closer to the limit. Memory used for other resources,
ex. http connections data, various supporting structures, are not actively
controlled and instead some memory is taken out from overall limit.

Overall, scheduling does a best-effort memory limiting. At the time of
scheduling, the client reserves memory by using buffer pool ticketing mechanism.
Buffer is acquired from the pool using the ticket as close to the usage as
possible (this approach peaks at lower mem usage than preallocating all mem
upfront because buffers cannot be used right away, ex reading from file will
fill buffers slower than they are sent, leading to decent amount of buffer reuse)
Reservation mechanism is approximate and in some cases can lead to actual memory
usage being higher once tickets are redeemed. The client reserves some memory to
mitigate overflows like that.
133 changes: 133 additions & 0 deletions include/aws/s3/private/s3_buffer_pool.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
#ifndef AWS_S3_BUFFER_ALLOCATOR_H
#define AWS_S3_BUFFER_ALLOCATOR_H

/**
* Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
* SPDX-License-Identifier: Apache-2.0.
*/

#include <aws/s3/s3.h>

/*
* S3 buffer pool.
* Buffer pool used for pooling part sized buffers for Put/Get operations.
* Provides additional functionally for limiting overall memory used.
* High-level buffer pool usage flow:
* - Create buffer with overall memory limit and common buffer size, aka chunk
* size (typically part size configured on client)
* - For each request:
* -- call reserve to acquire ticket for future buffer acquisition. this will
* mark memory reserved, but would not allocate it. if reserve call hits
* memory limit, it fails and reservation hold is put on the whole buffer
* pool. (aws_s3_buffer_pool_remove_reservation_hold can be used to remove
* reservation hold).
* -- once request needs memory, it can exchange ticket for a buffer using
* aws_s3_buffer_pool_acquire_buffer. this operation never fails, even if it
* ends up going over memory limit.
* -- buffer lifetime is tied to the ticket. so once request is done with the
* buffer, ticket is released and buffer returns back to the pool.
*/

AWS_EXTERN_C_BEGIN

struct aws_s3_buffer_pool;
struct aws_s3_buffer_pool_ticket;

struct aws_s3_buffer_pool_usage_stats {
/* Effective Max memory limit. Memory limit value provided during construction minus
* buffer reserved for overhead of the pool */
size_t mem_limit;

/* How much mem is used in primary storage. includes memory used by blocks
* that are waiting on all allocs to release before being put back in circulation. */
size_t primary_used;
/* Overall memory allocated for blocks. */
size_t primary_allocated;
/* Reserved memory. Does not account for how that memory will map into
* blocks and in practice can be lower than used memory. */
size_t primary_reserved;
/* Number of blocks allocated in primary. */
size_t primary_num_blocks;

/* Secondary mem used. Accurate, maps directly to base allocator. */
size_t secondary_used;
/* Secondary mem reserved. Accurate, maps directly to base allocator. */
size_t secondary_reserved;
};

/*
* Create new buffer pool.
* chunk_size - specifies the size of memory that will most commonly be acquired
* from the pool (typically part size).
* mem_limit - limit on how much mem buffer pool can use. once limit is hit,
* buffers can no longer be reserved from (reservation hold is placed on the pool).
* Returns buffer pool pointer on success and NULL on failure.
*/
AWS_S3_API struct aws_s3_buffer_pool *aws_s3_buffer_pool_new(
struct aws_allocator *allocator,
size_t chunk_size,
size_t mem_limit);

/*
* Destroys buffer pool.
* Does nothing if buffer_pool is NULL.
*/
AWS_S3_API void aws_s3_buffer_pool_destroy(struct aws_s3_buffer_pool *buffer_pool);

/*
* Reserves memory from the pool for later use.
* Best effort and can potentially reserve memory slightly over the limit.
* Reservation takes some memory out of the available pool, but does not
* allocate it right away.
* On success ticket will be returned.
* On failure NULL is returned, error is raised and reservation hold is placed
* on the buffer. Any further reservations while hold is active will fail.
* Remove reservation hold to unblock reservations.
*/
AWS_S3_API struct aws_s3_buffer_pool_ticket *aws_s3_buffer_pool_reserve(
struct aws_s3_buffer_pool *buffer_pool,
size_t size);

/*
* Whether pool has a reservation hold.
*/
AWS_S3_API bool aws_s3_buffer_pool_has_reservation_hold(struct aws_s3_buffer_pool *buffer_pool);

/*
* Remove reservation hold on pool.
*/
AWS_S3_API void aws_s3_buffer_pool_remove_reservation_hold(struct aws_s3_buffer_pool *buffer_pool);

/*
* Trades in the ticket for a buffer.
* Cannot fail and can over allocate above mem limit if reservation was not accurate.
* Using the same ticket twice will return the same buffer.
* Buffer is only valid until the ticket is released.
*/
AWS_S3_API struct aws_byte_buf aws_s3_buffer_pool_acquire_buffer(
struct aws_s3_buffer_pool *buffer_pool,
struct aws_s3_buffer_pool_ticket *ticket);

/*
* Releases the ticket.
* Any buffers associated with the ticket are invalidated.
*/
AWS_S3_API void aws_s3_buffer_pool_release_ticket(
struct aws_s3_buffer_pool *buffer_pool,
struct aws_s3_buffer_pool_ticket *ticket);

/*
* Get pool memory usage stats.
*/
AWS_S3_API struct aws_s3_buffer_pool_usage_stats aws_s3_buffer_pool_get_usage(struct aws_s3_buffer_pool *buffer_pool);

/*
* Trims all unused mem from the pool.
* Warning: fairly slow operation, do not use in critical path.
* TODO: partial trimming? ex. only trim down to 50% of max?
*/
AWS_S3_API void aws_s3_buffer_pool_trim(struct aws_s3_buffer_pool *buffer_pool);

AWS_EXTERN_C_END

#endif /* AWS_S3_BUFFER_ALLOCATOR_H */
8 changes: 8 additions & 0 deletions include/aws/s3/private/s3_client_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,8 @@ struct aws_s3_upload_part_timeout_stats {
struct aws_s3_client {
struct aws_allocator *allocator;

struct aws_s3_buffer_pool *buffer_pool;

struct aws_s3_client_vtable *vtable;

struct aws_ref_count ref_count;
Expand Down Expand Up @@ -340,6 +342,9 @@ struct aws_s3_client {
/* Task for processing requests from meta requests on connections. */
struct aws_task process_work_task;

/* Task for trimming buffer bool. */
struct aws_task trim_buffer_pool_task;

/* Number of endpoints currently allocated. Used during clean up to know how many endpoints are still in
* memory.*/
uint32_t num_endpoints_allocated;
Expand Down Expand Up @@ -378,6 +383,9 @@ struct aws_s3_client {

/* Number of requests currently being prepared. */
uint32_t num_requests_being_prepared;

/* Whether or not work processing is currently scheduled. */
uint32_t trim_buffer_pool_task_scheduled : 1;
} threaded_data;
};

Expand Down
9 changes: 8 additions & 1 deletion include/aws/s3/private/s3_request.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include <aws/common/thread.h>
#include <aws/s3/s3.h>

#include <aws/s3/private/s3_buffer_pool.h>
#include <aws/s3/private/s3_checksums.h>

struct aws_http_message;
Expand All @@ -22,6 +23,7 @@ enum aws_s3_request_flags {
AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS = 0x00000001,
AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY = 0x00000002,
AWS_S3_REQUEST_FLAG_ALWAYS_SEND = 0x00000004,
AWS_S3_REQUEST_FLAG_PART_SIZE_REQUEST_BODY = 0x00000008,
};

/**
Expand Down Expand Up @@ -112,6 +114,8 @@ struct aws_s3_request {
* retried.*/
struct aws_byte_buf request_body;

struct aws_s3_buffer_pool_ticket *ticket;

/* Beginning range of this part. */
/* TODO currently only used by auto_range_get, could be hooked up to auto_range_put as well. */
uint64_t part_range_start;
Expand Down Expand Up @@ -184,7 +188,10 @@ struct aws_s3_request {
uint32_t record_response_headers : 1;

/* When true, the response body buffer will be allocated in the size of a part. */
uint32_t part_size_response_body : 1;
uint32_t has_part_size_response_body : 1;

/* When true, the request body buffer will be allocated in the size of a part. */
uint32_t has_part_size_request_body : 1;

/* When true, this request is being tracked by the client for limiting the amount of in-flight-requests/stats. */
uint32_t tracked_by_client : 1;
Expand Down
1 change: 1 addition & 0 deletions include/aws/s3/s3.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ enum aws_s3_errors {
AWS_ERROR_S3_INCORRECT_CONTENT_LENGTH,
AWS_ERROR_S3_REQUEST_TIME_TOO_SKEWED,
AWS_ERROR_S3_FILE_MODIFIED,
AWS_ERROR_S3_EXCEEDS_MEMORY_LIMIT,
AWS_ERROR_S3_END_RANGE = AWS_ERROR_ENUM_END_RANGE(AWS_C_S3_PACKAGE_ID)
};

Expand Down
3 changes: 3 additions & 0 deletions include/aws/s3/s3_client.h
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,9 @@ struct aws_s3_client_config {
/* Throughput target in Gbps that we are trying to reach. */
double throughput_target_gbps;

/* How much memory can we use. */
size_t memory_limit_in_bytes;

/* Retry strategy to use. If NULL, a default retry strategy will be used. */
struct aws_retry_strategy *retry_strategy;

Expand Down
1 change: 1 addition & 0 deletions source/s3.c
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ static struct aws_error_info s_errors[] = {
AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_INCORRECT_CONTENT_LENGTH, "Request body length must match Content-Length header."),
AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_REQUEST_TIME_TOO_SKEWED, "RequestTimeTooSkewed error received from S3."),
AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_FILE_MODIFIED, "The file was modified during upload."),
AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_EXCEEDS_MEMORY_LIMIT, "Request was not created due to used memory exceeding memory limit."),
};
/* clang-format on */

Expand Down
25 changes: 22 additions & 3 deletions source/s3_auto_ranged_get.c
Original file line number Diff line number Diff line change
Expand Up @@ -177,13 +177,21 @@ static bool s_s3_auto_ranged_get_update(
meta_request,
AWS_S3_AUTO_RANGE_GET_REQUEST_TYPE_HEAD_OBJECT,
0,
AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS | AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY);
AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS);

request->discovers_object_size = true;

auto_ranged_get->synced_data.head_object_sent = true;
}
} else if (auto_ranged_get->synced_data.num_parts_requested == 0) {

struct aws_s3_buffer_pool_ticket *ticket =
aws_s3_buffer_pool_reserve(meta_request->client->buffer_pool, meta_request->part_size);

if (ticket == NULL) {
goto has_work_remaining;
}

/* If we aren't using a head object, then discover the size of the object while trying to get the
* first part. */
request = aws_s3_request_new(
Expand All @@ -192,6 +200,7 @@ static bool s_s3_auto_ranged_get_update(
1,
AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS | AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY);

request->ticket = ticket;
request->part_range_start = 0;
request->part_range_end = meta_request->part_size - 1; /* range-end is inclusive */
request->discovers_object_size = true;
Expand Down Expand Up @@ -253,12 +262,21 @@ static bool s_s3_auto_ranged_get_update(
auto_ranged_get->synced_data.read_window_warning_issued = 0;
}

struct aws_s3_buffer_pool_ticket *ticket =
aws_s3_buffer_pool_reserve(meta_request->client->buffer_pool, meta_request->part_size);

if (ticket == NULL) {
goto has_work_remaining;
}

request = aws_s3_request_new(
meta_request,
AWS_S3_AUTO_RANGE_GET_REQUEST_TYPE_PART,
auto_ranged_get->synced_data.num_parts_requested + 1,
AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY);

request->ticket = ticket;

aws_s3_get_part_range(
auto_ranged_get->synced_data.object_range_start,
auto_ranged_get->synced_data.object_range_end,
Expand Down Expand Up @@ -412,10 +430,11 @@ static struct aws_future_void *s_s3_auto_ranged_get_prepare_request(struct aws_s
/* Success! */
AWS_LOGF_DEBUG(
AWS_LS_S3_META_REQUEST,
"id=%p: Created request %p for part %d",
"id=%p: Created request %p for part %d part sized %d",
(void *)meta_request,
(void *)request,
request->part_number);
request->part_number,
request->has_part_size_response_body);

success = true;

Expand Down
Loading

0 comments on commit f961971

Please sign in to comment.