Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Detect unsolicited write recv support status on both sides #10657

Merged
merged 1 commit into from
Jan 1, 2025

Conversation

shijin-aws
Copy link
Contributor

Currently, when local support unsolicited write recv while the peer doesn't support it, the peer will crash because it expects to get a valid wr_id for IBV_WC_RECV_RDMA_WITH_IMM op code. This peer crash can cause weird error message on sender side's cq when it is still sending data to it. This patch makes the initiator of rdma write imm detect the unsolicited write recv support status on both sides. If there is inconsistency, the initiator will return error with clear error messages that instruct the mitigation.

@shijin-aws shijin-aws requested a review from a team December 21, 2024 00:47
@shijin-aws shijin-aws force-pushed the unsolicited_recv_handshake branch 2 times, most recently from a6a0596 to f28e40a Compare December 22, 2024 02:10
"recv feature by setting environment variable FI_EFA_USE_UNSOLICITED_WRITE_RECV=0\n",
efa_rdm_use_unsolicited_write_recv(), efa_rdm_peer_support_unsolicited_write_recv(txe->peer),
ep->err_msg);
return -FI_EINVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't FI_EOPNOTSUPP be a better error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated.

@shijin-aws shijin-aws force-pushed the unsolicited_recv_handshake branch 2 times, most recently from 716c5ed to 3d9c9b4 Compare December 30, 2024 23:58
Currently, when local support unsolicited write recv while the peer
doesn't support it, the peer will crash because it expects to get
a valid wr_id for IBV_WC_RECV_RDMA_WITH_IMM op code. This peer crash
can cause weird error message on sender side's cq when it is still
sending data to it. When local doesn't support unsolicited write recv
while the peer support it, local will get cq error for the rdma op as
"Unexpected status" as well.

This patch makes the initiator of rdma write imm
detect the unsolicited write recv support status on both sides. If
there is inconsistency, the initiator will return error with clear
error messages that instruct the mitigation.

Signed-off-by: Shi Jin <[email protected]>
@shijin-aws shijin-aws force-pushed the unsolicited_recv_handshake branch from 3d9c9b4 to 95d7665 Compare December 31, 2024 00:58
@shijin-aws shijin-aws merged commit ed5560a into ofiwg:main Jan 1, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants