-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client.detach_test failing: kill cannot find target process #6536
Comments
Xref client.attach_test failing with a similar error; the frequency of its failures may have increased with PR #6513? Maybe not enough data to say for that but it failed 2 out of 4 times in recent PR's. |
@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added |
I'm sorry to hear that. I'm not able to reproduce this problem in my local testing. And I haven't seen similar problems when I submitted code for previous PR. I guess there may be a conflict with the newly submitted code. I'm trying to debug the test suite step by step in a new PR #6537. If I delete the test code for Linux's detachment, the attach test works well. So I think there may be some problems in the detach test instead of real detach code's problem. |
The attach test was failing every once in a while with the same output: #6452. So it may be the same underlying pre-existing issue that is just more frequent in the detach test. From the output it looks like the attach never happened b/c it never prints "thank you for testing..."? |
@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine?
|
The
Is it possible to solve this problem by letting the detach test execute by launching instead of attaching? |
Thank you, this method is quite useful. But it seems that this error output in local testing isn't equal to the output of github's test suite.
|
After completing next week's exams, I will attempt to address the above questions. |
Sure, thank you for your efforts. |
It seems that the flakiness of attach and detach has disappeared inexplicably in PR 6537. In the current code, the |
You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running |
|
Hmm, it did repro on my machine every so often as shown at #6536 (comment). Not sure when I will have time to try to debug it though. |
The error of my local testing is
|
Does your timeout fix solve the timeouts that we see in these tests in the |
The timeouts turn out to be just missing ptrace privs: #6558 |
I have tested again in PR 6537, the detach test can be passed smoothly now. |
I think it failed every single time on the aarch64 machine due to the ptrace_scope. |
Oh you mean you figured out the error about not finding the process to kill and having no output? |
All right,I haven‘t find the solution of this problem. I misunderstand that the flakiness on X64 is also caused by ptrace privilege. |
This reproduces in release build on a local machine. Success:
Failure:
The file:
|
That kill that failed I believe is sending SIGTERM which is the expected method for the infloop app to exit. So I would assume that the detach crashed the process (after printing "detach") and so it's gone by the time the SIGTERM is sent. |
If the detach crashed the process on X86-64 platform sometimes , it may happen at the stage of setting nudged thread's context. On the X86-64 platform, I didn't use the methods of sigreturn before due to the lack of fpstate information inside dcontext. But the fpstate information can be also obtained by the sc(sigcontext). So I create a new pr in #6579 to detach the process by sigreturn on X86-64 platform. It works well on my local machine(repeat-until-fail). Could you try to see if it works on your machine stably? |
It still fails locally and on the x86-64 job (after a re-run; passed 1st time): #6579 (comment) We are going to have to mark this test as flaky to get our suite green. Please continue working on it and when the detach feature is fully working we can un-mark the test. |
I'll work on it and submit the related code if feasible solutions are found. |
So you still cannot reproduce on any of your local machines, and you couldn't reproduce under tmate on the Github Actions runners here? It is failing every other time so you'd think it would reproduce under tmate on the Actions runners. |
I have reproduced the failure result of timeout on another new x86-64 machine. And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c. It works well on the new machine now. I re-ran the GA action more than five time and the failure of detach_test hasn't appeared (PR #6579). Could you please try to test it on your machine again? By the way, all of my modifications are based on the previous version of DynamoRIO. There will be the following error when I try to build the newest DynamoRIO. Would you like to tell me why this error occurred if you ever met it?
|
Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized. |
But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach? |
In the 317 line of suite/tests/runall.cmake file, there is a "kill_background_process(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform. |
It works for me, thanks. |
It's also unstable sometimes. https://github.com/DynamoRIO/dynamorio/actions/runs/7698709453/job/20978621263?pr=6579 |
The Linux client.detach_test was just added in PR #6513 and it is failing (non-deterministically) on x86-64.
https://github.com/DynamoRIO/dynamorio/actions/runs/7381905578/job/20080993968?pr=6531
The text was updated successfully, but these errors were encountered: