-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7 tests fail on aarch64 when building for Fedora 39, 40 and rawhide #473
Comments
Are you consistently seeing those 7 tests failing for all the platforms you listed (Fedora 39, 40 and rawhide on aarch64)? From the stack trace, the crash comes from memory allocation, which mostly caused by out of memory. |
Yeah, in that stacktrace it's failing to allocate memory during the Are you running all 600+ tests in parallel? I could imagine launching 600+ executables simultaneously might cause some of the later ones to experience malloc failure? But that's really the only guess I can come up with. PS: The reason the stacktrace is enormous is due to an infinite loop: The memory allocation failure triggers an assert, the assert tries to print the backtrace, this needs to perform allocations, which fail and trigger an assert... loop... loop... loop. But this isn't the cause of the test failures, just some expected behavior when an assertion fails. |
Yes, as far I can see always the same 7 tests, nothing random.
I use the Fedora build infrastructure (https://copr.fedoraproject.org)
Based on the logs they seem to fail after
No they run in sequence one after each other as far I can tell from the logs. I let run cmake test without any custom parameters at least. I can run manual tests on an EC2 Graviton instance if that helps. But given that the tests are fine on RHEL 8 and RHEL 9 in the same build infra on aarch64, but failing on F39, F40 and rawhide gives me the impression it's not a general problem with the build infra / resources. The major difference I see is the OpenSSL version that's installed and used. My guess is that it could therefore related to those issues? But why would it then only affect aarch64 but not x86_64 :-/ |
yeah ... it is suspicious that it's just those specific tests, all of which share some helper functions. Maybe there's an uninitialized variable in there? Could you give me steps I could use to reproduce and iterate on this? I'm not familiar with COPR. I can't find a way to run Fedora 40 directly. I tried to run it on an EC2 Graviton machine, but couldn't find a free Amazon Machine Image (AMI) for Fedora 39, 40, or rawhide. So instead I ran Amazon Linux 2023 on a Graviton machine, and within that I used docker to run the
|
I dug into what makes those 7 tests different from all the other tests defined in the same file. One difference that stands out, is they all have a large error_tester struct defined in the function, which means it's created on the stack. This struct is 403KiB. The tests in this file that pass all use a similar s_tester struct, except that one's defined as a global variable, instead of a stack variable. I've confirmed that the test crashes with a coredump if I set the stack size to 256KiB via Maybe when COPR runs aarch Fedora 39, 40, rawhide, the default stack size is smaller than other configurations? Can you check the default stack size on these COPR runs via Also, did this crash start when you updated aws-c-http? or when you updated the fedora versions you're testing? Or is this the first time you're ever testing aws-c-http? |
@graebm thank you for looking into this and doing all the testing. This problem seem to become tricky to reproduce. I'm packaging
|
The Fedora Project has another build environment, called Koji, it's used for building the actual distribution packages that become part of the Fedora Repos. There was no issue building the package, including runnint the tests, in this environment. Looks like on-premises hosts with a bit more resource compared to the build machines in Copr.
You can go to https://fedoraproject.org/cloud/download#cloud_launch and launch an instance from there. It opens the AWS Console with the pre-filled AMI ID. I launched a
Everything went fine, I checked the spec files of aws-c-http and its dependencies that I maintain. aws-c-common (no I don't use Next test, I installed the Fedora Packager Tools (https://docs.fedoraproject.org/en-US/package-maintainers/Installing_Packager_Tools/).
Doing a mockbuild with AlmaLinux for EPEL9:
And at this point I'm a bit clueless.
The mock configs and templates defining which container image to use, which repos to attach, packages to install and things like that. BUT why are we fine on a Fedora 40 aarch64 host when compiling and running the tests manually but it fails in a F40 container on the same system O_o. |
I updated the package today to the latest release and the behaviour was similar as last time. So the issue is pretty limited to a specific mock environment and hard to re-produce. Overall it works and in the build environment "that counts" the problem doesn't show up either. I'm closing this Issue for now. Thanks for your support! |
Describe the bug
Hi,
I'm packaging aws-c-http for Fedora and EPEL. I run the unit tests as part of the build process. 7 tests fail on Fedora (39, 49, rawhide) on aarch64. They run successful on x86_64. Also EPEL8 (RHEL8) and EPEL9 (RHEL9) is fine on both architectures.
I suspect that the problem is related to the OpenSSL version and a behaviour specific to aarch64. I'm looking for help how to further troubleshoot and fix the issue. I'm not an expert in C and have a hard time to make sense out of the stacktrace.
I also packaged the dependencies like aws-c-common (https://src.fedoraproject.org/rpms/aws-c-common) or s2n-tls (https://src.fedoraproject.org/rpms/s2n-tls). All of them use
BUILD_SHARED_LIBS=ON
. I don't useaws-lc
, I build against the system crypto OpenSSL instead.Expected Behavior
Passed unit tests on Fedora 39, 40 and rawhide on aarch64, similar as on x86_64
Current Behavior
Works
EPEL8:
EPEL9:
Fails
F39:
F40:
F41:
The latest copr build for all architectures and OS versions: https://copr.fedorainfracloud.org/coprs/wombelix/aws-c-libs/build/7722282/
The relevant log outputs from
fedora-rawhide-x86_64
:Mapping of the failed test to its location in the aws-c-http source:
The following tests FAILED:
601 - h1_server_close_before_message_is_sent (Failed)
aws-c-http/tests/CMakeLists.txt
Line 593 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1591 in 079ccfd
aws-c-http/tests/CMakeLists.txt
Line 594 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1629 in 079ccfd
aws-c-http/tests/CMakeLists.txt
Line 595 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1635 in 079ccfd
aws-c-http/tests/CMakeLists.txt
Line 596 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1641 in 079ccfd
aws-c-http/tests/CMakeLists.txt
Line 597 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1653 in 079ccfd
aws-c-http/tests/CMakeLists.txt
Line 598 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1647 in 079ccfd
aws-c-http/tests/CMakeLists.txt
Line 599 in 079ccfd
aws-c-http/tests/test_h1_server.c
Line 1658 in 079ccfd
All seem to have in common that they use:
s_test_error_from_callback
aws-c-http/tests/test_h1_server.c
Line 1523 in 079ccfd
Stracktrace attached, too long to add it inline:
stacktrace_aws-c-http_f41_rawhide.txt
Reproduction Steps
Build and run the unit tests on a Fedora 30, 40 or rawhide aarch64 system.
Possible Solution
No response
Additional Information/Context
No response
aws-c-http version used
0.8.2
Compiler and version used
13.3.1-1.fc39, 14.1.1-7.fc40, 14.1.1-7.fc41
Operating System and version
Fedora 39, 40, rawhide aarch64
The text was updated successfully, but these errors were encountered: