-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fabtests/multi_ep: add more stressing of resource binding to multi_ep test #10249
Conversation
aingerson
commented
Jul 30, 2024
- adds more MRs
- adds RMA testing
- adds closing MRs, reregistering, retesting
- adds threading level toggling for all tests
- adds more combinations of multi_ep testing to runfabtests.sh
- adds multi_ep testing back in for tcp provider
fabtests/functional/multi_ep.c
Outdated
cq_read_idx = 0; | ||
else | ||
cq_read_idx = i; | ||
while (recv_bufs[i][0] == 'A') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For EFA, I think this wait strategy is not safe unless NIC deliver all the data in the buffer once in sequence. We have some problem for it before and currently we only use this method when NIC supports RDMA write with in-order & write-once delivery. Why not use fi_writedata here so you can rely on the CQ poll to know data is delivered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense, I can change it. Thanks!
Is intel CI failure related? |
@shijin-aws Yes, it was. UCX doesn't support CQ data and also uses FI_MR_RAW so I had to add MR_RAW support and use an alternative solution to the CQ data (just used a regular sync). Let me know if efa would be ok with this solution. |
I don't think efa supports FI_MR_RAW, will look at your new solution |
AWS CI failure: It seems the remote key is wrong for fi_write
|
1a9b2bf
to
e8fa590
Compare
Same error
|
@shijin-aws I pushed some changes to make sure we weren't corrupting the key. Is the AWS failure the same key verification error? |
Yes, same failure
May I know does your patch respect |
@shijin-aws It should, yes. I put all the existing key exchange into a common function that we call in this test. That includes getting they key for the MR and sending that one to the peer instead of the requested key. See here |
@shijin-aws Can you send the AWS log whenever you get chance? No rush |
@aingerson sure
xpmem warning can be ignored. |
We run the same test on multiple OSes/VM types. Here is another failed run on different OS/VM
|
It can be a bug in efa provider though. I have seen the same test passed for single node which finally use shm, but not on double node. |
Can you remind me what changed in your latest version that makes the test failed? I remember the same test passed earlier when you use the cq data mechanism |
bot:aws:retest |
@shijin-aws Added some more fixes that were affecting other providers. Still same error with efa? Any luck with the logs? |
@aingerson This times seems the fabtests parser is broken ... How can that happen?
|
@shijin-aws Agh. Sorry I made a dumb mistake. I had to change the multi_ep test to use -A and -Q for --shared-av and --shared-cq instead of the long opts because it interfered with the common long opts. Sorry I missed that because I didn't run all the combinations of tests. Made sure all of the arg parsing works this time! |
@aingerson AWS CI is not using runfabtests.sh, it's using runfabtests.py, so if you change
What common long opts they interfere with? |
@shijin-aws multi_ep.c was overriding the common long opts with its own (shared-av and shared-cq) so it wasn't picking up the common code ones. So I had to remove the test-specific long opts and replace them with short ops
I repushed with the pytest fixes. Next week we can see if it still gets the same key error as before (I left the patch with the prints in there to debug). |
Spotted an ordering issue with the key exchange. Refactored the key exchange to eliminate that issue and repushed but still with the prints just in case (and with a few more as suggested) |
259e973
to
48174dd
Compare
8d07f52
to
93a21f3
Compare
bot:aws:retest |
@aingerson This time AWS CI passed except for the known issue I mentioned in #10362 (comment). It shouldn't block this PR |
@shijin-aws Sorry we had CI issues but I think everything is passing now on our end. CI still ok on your end or do you want to rerun? I had to update the raw key path but I don't think that should affect your path. Let me know if you think it's good to merge! |
bot:aws:retest |
@aingerson I thought you already merged it... Yeah feel free to go ahead |
@shijin-aws Ok awesome. I'll let this one run just in case and then merge. Thanks! |
@aingerson test pass, but it seems you need to rebase |
Instead of using one allocation and MR, separate into separate regions to test multiple MRs with multiple EPs Use common hmem alloc interfaces to properly use device support Signed-off-by: Alexia Ingerson <[email protected]>
Pull some of the mr key/addr exchange into separate functions that fill in and convert rma info before being exchanged so that separate tests can call this function and support FI_MR_RAW and FI_MR_VIRT_ADDR more easily Signed-off-by: Alexia Ingerson <[email protected]>
To test RMA in addition to FI_MSG, make the following changes: - Register all MRs for RMA use - Use existing message test to exchange RMA information (key, address) - Add RMA test with data validation after the message exchange test Also includes renaming remote_addr to remote_fiaddr to distinguish between fiaddr and RMA addr Signed-off-by: Alexia Ingerson <[email protected]>
Add extra stress testing on multiple EPs/MRs by closing all the MRs, re-registering them, and re-running the whole test sequence Signed-off-by: Alexia Ingerson <[email protected]>
Change the --shared-av and --shared-cq into short opts -A and -Q, respectively. This allows the multi_ep test to make sure of the common long opts Signed-off-by: Alexia Ingerson <[email protected]>
Add long option --threading for tests to use to allow caller to set threading level This removes default setting of domain_attr->threading in various tests since the default is now universally set to FI_THREAD_DOMAIN Signed-off-by: Alexia Ingerson <[email protected]>
As the most OFI resource intensive test, this test is a good test for testing different combinations of resource binding, especially with FI_THREAD_COMPLETION turned on. Applications that utilize threads will likely use a combination of separate domains, EPs, CQs, and AVs. Even though this test does not use threads and cannot test the protection against these resources, it can test different optimization paths within providers that may be triggered based on the threading level requested. Signed-off-by: Alexia Ingerson <[email protected]>
Signed-off-by: Alexia Ingerson <[email protected]>