-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Between version 1.11.159 and 1.11.305, the GetObjectAsync method of S3CrtClient has become very very slow. #2922
Comments
Can you please provide more detailed reproduction steps and include the code sample you are running for your tests? |
Thank you for your comment. The test code used is as follows.
|
This is an issue with CRT mem limiter feature that got added around reinvent and CPP sdk picked up sometime since then. I created awslabs/aws-c-s3#425 to follow up on it on CRT side. Quick summary of the issue is that you are configuring part size to be 999 mb, but all the files you are dealing with are in 32 mb size. CRT does not know how big the result of get will be, so it uses some heuristics to predict the how much memory will be needed to buffer the gets. Part size is part of that heuristic and in this case part size of 999 mb causes CRT to be overly cautious and only schedule a couple reqs at a time to avoid blowing up memory. Lowering part size should improve throughput. Also note that with part size much higher than object size, you would not see a lot of benefit from crt as it will revert to making one request per object. We recommend keeping part size in the 8mb-16mb range |
Thank you for your explanation. It used to work, but it wasn't good that I set the "part size" to 999MB in the code. I understand. When I changed the "part size" to 32MB, the GET worked in the same time as before. Thank you!
I thought that if the "part size" was smaller than the object size, the number of communications would increase, which was not good. |
Do you have any specific concerns with CRT using multiple connections when part size is configured lower. CRT is optimized toward getting higher throughput and the recommended way to achieve that with s3 is to parallelize requests across multiple connections. The downside of that approach is that s3 charges per each individual request. We currently do not have a way to turn off automated splitting for Gets and thats a potential feature we can consider. Do you have any logs for INTERNAL_FAILURE (setting logs to trace should dump relative info for CRT as well)? CRT sets part size to 8mb by default and looking at your code i dont see anything that might cause internal error. |
After outputting and checking the logs, I found that the cause was that my program was pre-calculating the checksum.
Log:
There were no issues when the checksum specification was omitted. Thank you.
If it improves performance, there are no concerns with CRT using multiple connections. However, as indicated below, the execution time is approximately doubled when skipping the pre-calculation of checksum, which is not acceptable.
|
Hey @MaZvnGpGm9 spent some time digging into this and writing some benchmark test based on your code and came up with insights into what you are seeing. for refernce here is the benchmark code i was running (note i was using google/benchmark to run the tests): main.cpp: #include <benchmark/benchmark.h>
#include <aws/core/Aws.h>
#include <aws/core/utils/HashingUtils.h>
#include <fstream>
#include <aws/s3-crt/S3CrtClient.h>
#include <aws/s3-crt/model/PutObjectRequest.h>
using namespace Aws;
using namespace Aws::Utils;
using namespace Aws::S3Crt;
using namespace Aws::S3Crt::Model;
const char *ALLOCATION_TAG = "checksum_benchmark";
const char *BUCKET = "your_bucket";
const char *KEY = "your_key";
constexpr int REQEUSTS_TO_MAKE = 10;
static SDKOptions s_options;
static void DoSetup(const benchmark::State &state)
{
InitAPI(s_options);
}
static void DoTeardown(const benchmark::State &state)
{
ShutdownAPI(s_options);
}
static std::shared_ptr<S3CrtClient> CreateClient()
{
S3Crt::ClientConfiguration config;
config.throughputTargetGbps = 10;
config.partSize = 32 * 1024 * 1024;
return Aws::MakeShared<S3CrtClient>(ALLOCATION_TAG, config);
}
// read a file from a dir named something like "32mb"
static std::shared_ptr<IOStream> CreateStream(size_t fileSize)
{
return Aws::MakeShared<FStream>(ALLOCATION_TAG,
"/path/to/test/file/" + std::to_string(fileSize) + "mb",
std::ios_base::in);
}
static PutObjectRequest CreateRequest(size_t fileSize)
{
auto stream = CreateStream(fileSize);
auto request = PutObjectRequest().WithBucket(BUCKET).WithKey(KEY);
request.SetBody(stream);
return request;
}
static void StartAsyncRequest(const std::shared_ptr<S3CrtClient>& client,
int& totalReqs,
std::condition_variable& cv,
std::mutex& requestMutex,
const PutObjectRequest& request)
{
client->PutObjectAsync(request,
[&totalReqs, &requestMutex, &cv](const S3CrtClient*,
const Model::PutObjectRequest&,
const Model::PutObjectOutcome& response,
const std::shared_ptr<const Aws::Client::AsyncCallerContext>&) -> void {
std::unique_lock lock(requestMutex);
assert(response.IsSuccess());
if (!response.IsSuccess()) {
std::cerr << "benchmark saw error: " << response.GetError().GetMessage() << "\n";
} else {
//Remove const to remove optmizaiton
auto mutResp = response;
benchmark::DoNotOptimize(mutResp);
}
totalReqs--;
cv.notify_all();
});
}
static void BM_MD5Checksum(benchmark::State &state) {
auto stream = CreateStream(state.range(0));
for (auto _: state) {
auto hash = HashingUtils::CalculateMD5(*stream);
benchmark::DoNotOptimize(hash);
}
}
static void BM_CRC32Checksum(benchmark::State &state) {
auto stream = CreateStream(state.range(0));
for (auto _: state) {
auto hash = HashingUtils::CalculateCRC32(*stream);
benchmark::DoNotOptimize(hash);
}
}
static void BM_S3PutObjectWithoutPrecalcChecksum(benchmark::State &state) {
const auto client = CreateClient();
for (auto _: state) {
auto request = CreateRequest(state.range(0));
auto response = client->PutObject(request);
assert(response.IsSuccess());
if (!response.IsSuccess()) {
std::cerr << "benchmark saw error: " << response.GetError().GetMessage() << "\n";
}
benchmark::DoNotOptimize(response);
}
}
static void BM_S3PutObjectWithPrecalcChecksum(benchmark::State &state) {
const auto client = CreateClient();
for (auto _: state) {
auto request = CreateRequest(state.range(0));
request.SetChecksumCRC32(HashingUtils::Base64Encode(HashingUtils::CalculateCRC32(*(request.GetBody()))));
auto response = client->PutObject(request);
assert(response.IsSuccess());
if (!response.IsSuccess()) {
std::cerr << "benchmark saw error: " << response.GetError().GetMessage() << "\n";
}
benchmark::DoNotOptimize(response);
}
}
static void BM_S3PutObjectAsyncWithoutPrecalcChecksum(benchmark::State &state) {
const auto client = CreateClient();
for (auto _: state) {
int totalReqs = REQEUSTS_TO_MAKE;
std::condition_variable cv;
std::mutex requestMutex;
for (int i = 0; i < totalReqs; ++i) {
auto request = CreateRequest(state.range(0));
StartAsyncRequest(client, totalReqs, cv, requestMutex, request);
}
std::unique_lock lock(requestMutex);
cv.wait(lock, [&totalReqs]() -> bool {
return totalReqs == 0;
});
}
}
static void BM_S3PutObjectAsyncWithPrecalcChecksum(benchmark::State &state) {
const auto client = CreateClient();
for (auto _: state) {
int totalReqs = REQEUSTS_TO_MAKE;
std::condition_variable cv;
std::mutex requestMutex;
for (int i = 0; i < totalReqs; ++i) {
auto request = CreateRequest(state.range(0));
request.SetChecksumCRC32(HashingUtils::Base64Encode(HashingUtils::CalculateCRC32(*(request.GetBody()))));
StartAsyncRequest(client, totalReqs, cv, requestMutex, request);
}
std::unique_lock lock(requestMutex);
cv.wait(lock, [&totalReqs]() -> bool {
return totalReqs == 0;
});
}
}
BENCHMARK(BM_MD5Checksum)
->Setup(DoSetup)
->ArgName("File Size in MB")
->Arg(32)
->Iterations(10)
->Unit(benchmark::kMillisecond)
->Teardown(DoTeardown)
->MeasureProcessCPUTime()
->UseRealTime();
BENCHMARK(BM_CRC32Checksum)
->Setup(DoSetup)
->ArgName("File Size in MB")
->Arg(32)
->Iterations(10)
->Unit(benchmark::kMillisecond)
->Teardown(DoTeardown)
->MeasureProcessCPUTime()
->UseRealTime();
BENCHMARK(BM_S3PutObjectWithoutPrecalcChecksum)
->Setup(DoSetup)
->ArgName("File Size in MB")
->Arg(32)
->Iterations(10)
->Unit(benchmark::kSecond)
->Teardown(DoTeardown)
->MeasureProcessCPUTime()
->UseRealTime();
BENCHMARK(BM_S3PutObjectWithPrecalcChecksum)
->Setup(DoSetup)
->ArgName("File Size in MB")
->Arg(32)
->Iterations(10)
->Unit(benchmark::kSecond)
->Teardown(DoTeardown)
->MeasureProcessCPUTime()
->UseRealTime();
BENCHMARK(BM_S3PutObjectAsyncWithoutPrecalcChecksum)
->Setup(DoSetup)
->ArgName("File Size in MB")
->Arg(32)
->Iterations(5)
->Unit(benchmark::kSecond)
->Teardown(DoTeardown)
->MeasureProcessCPUTime()
->UseRealTime();
BENCHMARK(BM_S3PutObjectAsyncWithPrecalcChecksum)
->Setup(DoSetup)
->ArgName("File Size in MB")
->Arg(32)
->Iterations(5)
->Unit(benchmark::kSecond)
->Teardown(DoTeardown)
->MeasureProcessCPUTime()
->UseRealTime();
BENCHMARK_MAIN(); CMakeLists.txt
Might you note that i only run 10 in parallel and not 500, when running with 500 testing was taking way too long, what kind of hardware are you testing on? im running on a macbook pro where it ends up looking like Run on (10 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x10)
Load Average: 3.34, 4.07, 3.57 but the results of that benchmark tests show Head:
1.11.159
From this we can draw several conclusions
Let me know what you think and if you can replicate the same results with the benchmark i have provided. Of course if you can update the benchmark to reflect something else please go ahead and do such and let me know what hardware you are running on. |
Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one. |
Describe the bug
After upgrading the AWS SDK for C++, performance issues occurred. To confirm the behavior, I created a simple program that PUTs 500 objects of 32MB each using S3CrtClient's PutObjectAsync and then GETs them using GetObjectAsync.
As a result, with version 1.11.159, the GET operation for the 500 objects completed in approximately 11 seconds. However, with version 1.11.305, even after 2 minutes, the operation did not finish.
Upon checking the CPU usage, it was observed that the CPU was completely unusable. Therefore, it is suspected that some of the changes between versions 1.11.159 and 1.11.305 has caused degradation in the GetObjectAsync functionality.
Expected Behavior
The performance of GetObjectAsync does not change between SDK version updates.
Current Behavior
When retrieving a large number of objects using GetObjectAsync, the CPU is not utilized efficiently, resulting in significantly longer execution times.
Reproduction Steps
This occurs when PUTting 500 objects of 32MB each using S3CrtClient's PutObjectAsync and then GETting them using GetObjectAsync.
Possible Solution
No response
Additional Information/Context
No response
AWS CPP SDK version used
1.11.305
Compiler and Version used
11.4.1 20230605 (Red Hat 11.4.1-2)
Operating System and version
RHEL9.3
The text was updated successfully, but these errors were encountered: