-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Implement the cq progress #10659
base: main
Are you sure you want to change the base?
Conversation
89e563a
to
5481222
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments, will take a closer look later
prov/efa/src/efa_cq.c
Outdated
|
||
addr = efa_av_reverse_lookup(base_ep->av, | ||
ibv_wc_read_slid(ibv_cq_ex), | ||
ibv_wc_read_src_qp(ibv_cq_ex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong, ibv_wc_read_src_qp
only applies to the src addr, which means it only applies to the rx operations (recv or recv imm). We need to determine the destination qpn for a send call. If there is an rdma-core API that does it, we should use it. If it doesn't exist, we need a plan to how to get such addr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find one API for this. Can we save the msg->addr to be part of fi_context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can, but that doesn't work for inject calls where you don't have a fi_context.
|
||
void efa_cq_progress(struct util_cq *cq) | ||
{ | ||
efa_cq_poll_ibv_cq(efa_env.efa_cq_read_size, cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier, we may need a domain level lock here to protect the qp_table
which is also accessed by the ep_enable
and ep_close
calls. But I am ok with making it as part of a follow up change when we implement the ep
interface for efa raw.
prov/efa/src/efa_cq.c
Outdated
} | ||
|
||
if (cq_entry->flags != 0) | ||
efa_cntr_report_rx_completion(&base_ep->util_ep, cq_entry->flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments, IIRC cntr should always be updated no matter a cq entry is written or not. (Double check man page please)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man page says Counters record the number of requested operations that have completed.
Does that mean update cntr for no completion(null wr_id) requests too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, look at the rdm code, it update cntr for inject calls as well where we don't write cqe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like rdm does not update cntr if ofi_cq_write fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a bug then
prov/efa/src/efa_cq.h
Outdated
return -FI_ENOBUFS; | ||
} | ||
|
||
*buf = err_msg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this will work? Your err_msg
will be released when the function returns as it is declared within the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it might cause undefined behavior. I will move err_msg to be a field of base_ep then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have to use a persistent err_msg
in base_ep
? If you read the ofi_cq_write_err
code, you will find it will mem_dup
the err_data
you filled in entry
. In other words, the err_data
you passed can be safely released after ofi_cq_write_err
returns. Just use a temporary char array in your code block that write the error cq.
8beb420
to
5797b52
Compare
Move efa_use_unsolicited_write_recv to efa.h so they can be used outside rdm. Signed-off-by: Jessie Yang <[email protected]>
prov/efa/src/efa_base_ep.h
Outdated
@@ -65,6 +66,7 @@ struct efa_base_ep { | |||
/* Only used by RDM ep type */ | |||
struct efa_qp *user_recv_qp; /* Separate qp to receive pkts posted by users */ | |||
struct efa_recv_wr *user_recv_wr_vec; | |||
char err_msg[EFA_ERROR_MSG_BUFFER_LENGTH]; /* A large enough buffer to store CQ/EQ error data used by e.g. fi_cq_readerr */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paste my earlier comment here again
Do you have to use a persistent err_msg
in base_ep
? If you read the ofi_cq_write_err
code, you will find it will mem_dup
the err_data
you filled in entry
. In other words, the err_data
you passed can be safely released after ofi_cq_write_err
returns. Just use a temporary char array in your code block that write the error cq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer to remove err_msg from rdm_ep too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Move efa_rdm_ep_raw_addr_str, efa_rdm_ep_get_peer_raw_addr, efa_rdm_ep_get_peer_raw_addr_str to base_ep.h so they can be used by efa-raw. Signed-off-by: Jessie Yang <[email protected]>
prov/efa/src/rdm/efa_rdm_cq.c
Outdated
@@ -348,11 +348,11 @@ static void efa_rdm_cq_handle_recv_completion(struct efa_ibv_cq *ibv_cq, struct | |||
* QP and we cannot cancel that. | |||
*/ | |||
if (OFI_UNLIKELY(ep->use_zcpy_rx && efa_rdm_pkt_type_is_rtm(pkt_type))) { | |||
void *errbuf; | |||
void *errbuf = {0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be a new err_msg[EFA_ERROR_MSG_BUFFER_LENGTH]
here?
Use a temporary char array to hold error message. err_data can be safely released after ofi_cq_write_err returns. No need to use a persistent field in ep. Signed-off-by: Jessie Yang <[email protected]>
Rename efa_dgram_cq.c to efa_cq.c and move it to prov/efa/src as a common CQ interface for efa-raw’s rdm and dgram ep type. Create efa_cq_progress and poll ibv cq directly instead of using efa_dgram_ep_progress. Remove rcq and scq, write completion to util_ep.rx_cq and tx_cq. Construct the error message for cq_err_entry.err_data. Signed-off-by: Jessie Yang <[email protected]>
Signed-off-by: Jessie Yang <[email protected]>
return -FI_ENOMEM; | ||
|
||
err = ofi_cq_init(&efa_prov, domain_fid, attr, &cq->util_cq, | ||
&efa_cq_progress, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a similar change for efa_cntr: https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_cntr.c#L200-L202. Since you make dgram_ep_progress as a no op, ofi_cntr_progress
will be a no-op too as it call ep progress in the same way as ofi_cq_progress
does. You need a efa_cntr_progress
that polls ibv_cq as well
@@ -53,7 +53,7 @@ struct efa_conn *efa_av_addr_to_conn(struct efa_av *av, fi_addr_t fi_addr) | |||
struct util_av_entry *util_av_entry; | |||
struct efa_av_entry *efa_av_entry; | |||
|
|||
if (OFI_UNLIKELY(fi_addr == FI_ADDR_UNSPEC)) | |||
if (OFI_UNLIKELY(fi_addr == FI_ADDR_UNSPEC || fi_addr == FI_ADDR_NOTAVAIL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called inefa_base_ep_get_peer_raw_addr
. We temporarily return FI_ADDR_NOTAVAIL as peer addr for TX operation in efa_cq_handle_error
.
cq_entry->buf, cq_entry->data, | ||
cq_entry->tag, FI_ADDR_NOTAVAIL); | ||
else | ||
ret = ofi_cq_write(tx_cq, cq_entry->op_context, cq_entry->flags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are still two copies one after the other in the CQ read path. Can we eliminate one of the copies?
The CQ read path is (correct me if I'm wrong)
- Application calls
fi_cq_read
which we set toofi_cq_read
ofi_cq_read
->ofi_cq_readfrom
ofi_cq_readfrom
first callscq->progress(cq)
which we set toefa_cq_progress
->efa_cq_poll_ibv_cq
-> poll IBV CQ ->ofi_cq_write
and we copy the IBV CQ completion into the Libfabric CQ circular queueofi_cq_readfrom
then callsofi_cq_read_entries
and we copy the CQ entry from Libfabric CQ circular queue to the application CQ entry buffer
If the progress was more complicated (progress other CQs or EPs for protocol reasons), then we need the circular queue. But in this case, we really don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Application calls fi_cq_read
with a count
. We can stop polling after we get count
completions. rdma-core is already buffering the CQ entries, so we can just take advantage of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the progress was more complicated (progress other CQs or EPs for protocol reasons), then we need the circular queue. But in this case, we really don't.
I don't think it matters with how complicated the protocol is. For protocol case, we only write ofi cq when an operation has completed, which may involve multiple ibv cqes. You can always customize a function that propagate the completion to the buf
in the fi_cq_read directly to achieve best performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the current dgram code was kind of designed in this direction, see https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/dgram/efa_dgram_cq.c#L161. But then it rewrite the cq entries in buf
back to ofi cq https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/dgram/efa_dgram_ep.c#L319-L338. We may reuse the dgram code pattern but remove that ofi_cq_write*
phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem I think of is that, if you want to get rid of libfabric's cirque, there is no way for you to make a progress engine while staging the cqes. IIRC fi_cntr_read
needs to progress the ep for manual progress model, and that progress function needs to poll ibv cq to update the cntr while staging the cqes for another fi_cq_read
to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://ofiwg.github.io/libfabric/main/man/fi_domain.3.html
FI_PROGRESS_MANUAL
This progress model indicates that the provider requires the use of an application thread to complete an asynchronous request. When manual progress is set, the provider will attempt to advance an asynchronous operation forward when the application attempts to wait on or read an event queue, completion queue, or counter where the completed operation will be reported. Progress also occurs when the application processes a poll or wait set that has been associated with the event or completion queue.
If we want to support the case where application use both fi_cq_read
and fi_cntr_read
to progress the operations, we still need to use cirque in util cq.
wr->num_sge = msg->iov_count; | ||
wr->sg_list = base_ep->efa_recv_wr_vec[wr_index].sge; | ||
wr->wr_id = (uintptr_t) ((flags & FI_COMPLETION) ? msg->context : NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with Jessie. According to the docs for FI_CONTEXT
This structure should be treated as opaque to the application. For performance reasons, this structure must be allocated by the user, but may be used by the fabric provider to track the operation.
So we can use this buffer however we want. I would suggest setting it up as a bitmask and have FI_COMPLETION be one of the bits. That way, we can use the other bits for something else in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against using fi_context
buffer to store some metadata like op flags and remote addr, but I am not sure how useful it is to store FI_COMPLETION
, as the man page said
If an operation does not generate a completion (i.e. the endpoint was configured with FI_SELECTIVE_COMPLETION and the operation was not initiated with the FI_COMPLETION flag) then the context parameter is ignored by the fabric provider
Rename efa_dgram_cq.c to efa_cq.c and move it to prov/efa/src as a common CQ interface for efa-raw’s rdm and dgram ep type.
Create efa_cq_progress and poll ibv cq directly instead of using efa_dgram_ep_progress.
Remove rcq and scq, write completion to util_ep.rx_cq and tx_cq.
Construct the error message for cq_err_entry.err_data.