Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Implement the cq progress #10659

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

jiaxiyan
Copy link
Contributor

Rename efa_dgram_cq.c to efa_cq.c and move it to prov/efa/src as a common CQ interface for efa-raw’s rdm and dgram ep type.
Create efa_cq_progress and poll ibv cq directly instead of using efa_dgram_ep_progress.
Remove rcq and scq, write completion to util_ep.rx_cq and tx_cq.
Construct the error message for cq_err_entry.err_data.

@jiaxiyan jiaxiyan force-pushed the efa_cq branch 8 times, most recently from 89e563a to 5481222 Compare December 27, 2024 01:08
Copy link
Contributor

@shijin-aws shijin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments, will take a closer look later

prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved

addr = efa_av_reverse_lookup(base_ep->av,
ibv_wc_read_slid(ibv_cq_ex),
ibv_wc_read_src_qp(ibv_cq_ex));
Copy link
Contributor

@shijin-aws shijin-aws Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, ibv_wc_read_src_qp only applies to the src addr, which means it only applies to the rx operations (recv or recv imm). We need to determine the destination qpn for a send call. If there is an rdma-core API that does it, we should use it. If it doesn't exist, we need a plan to how to get such addr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find one API for this. Can we save the msg->addr to be part of fi_context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but that doesn't work for inject calls where you don't have a fi_context.

prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved

void efa_cq_progress(struct util_cq *cq)
{
efa_cq_poll_ibv_cq(efa_env.efa_cq_read_size, cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier, we may need a domain level lock here to protect the qp_table which is also accessed by the ep_enable and ep_close calls. But I am ok with making it as part of a follow up change when we implement the ep interface for efa raw.

}

if (cq_entry->flags != 0)
efa_cntr_report_rx_completion(&base_ep->util_ep, cq_entry->flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments, IIRC cntr should always be updated no matter a cq entry is written or not. (Double check man page please)

Copy link
Contributor Author

@jiaxiyan jiaxiyan Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man page says Counters record the number of requested operations that have completed. Does that mean update cntr for no completion(null wr_id) requests too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, look at the rdm code, it update cntr for inject calls as well where we don't write cqe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like rdm does not update cntr if ofi_cq_write fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bug then

@shijin-aws shijin-aws self-requested a review December 27, 2024 06:36
prov/efa/src/efa_msg.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.h Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.h Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.h Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.h Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.h Outdated Show resolved Hide resolved
prov/efa/src/efa_cq.h Outdated Show resolved Hide resolved
return -FI_ENOBUFS;
}

*buf = err_msg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this will work? Your err_msg will be released when the function returns as it is declared within the stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might cause undefined behavior. I will move err_msg to be a field of base_ep then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to use a persistent err_msg in base_ep ? If you read the ofi_cq_write_err code, you will find it will mem_dup the err_data you filled in entry. In other words, the err_data you passed can be safely released after ofi_cq_write_err returns. Just use a temporary char array in your code block that write the error cq.

@jiaxiyan jiaxiyan force-pushed the efa_cq branch 2 times, most recently from 8beb420 to 5797b52 Compare December 30, 2024 21:02
Move efa_use_unsolicited_write_recv to efa.h
so they can be used outside rdm.

Signed-off-by: Jessie Yang <[email protected]>
@@ -65,6 +66,7 @@ struct efa_base_ep {
/* Only used by RDM ep type */
struct efa_qp *user_recv_qp; /* Separate qp to receive pkts posted by users */
struct efa_recv_wr *user_recv_wr_vec;
char err_msg[EFA_ERROR_MSG_BUFFER_LENGTH]; /* A large enough buffer to store CQ/EQ error data used by e.g. fi_cq_readerr */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paste my earlier comment here again

Do you have to use a persistent err_msg in base_ep ? If you read the ofi_cq_write_err code, you will find it will mem_dup the err_data you filled in entry. In other words, the err_data you passed can be safely released after ofi_cq_write_err returns. Just use a temporary char array in your code block that write the error cq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer to remove err_msg from rdm_ep too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

prov/efa/src/efa_cq.c Outdated Show resolved Hide resolved
Move efa_rdm_ep_raw_addr_str, efa_rdm_ep_get_peer_raw_addr,
efa_rdm_ep_get_peer_raw_addr_str to base_ep.h so they can be used
by efa-raw.

Signed-off-by: Jessie Yang <[email protected]>
@@ -348,11 +348,11 @@ static void efa_rdm_cq_handle_recv_completion(struct efa_ibv_cq *ibv_cq, struct
* QP and we cannot cancel that.
*/
if (OFI_UNLIKELY(ep->use_zcpy_rx && efa_rdm_pkt_type_is_rtm(pkt_type))) {
void *errbuf;
void *errbuf = {0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be a new err_msg[EFA_ERROR_MSG_BUFFER_LENGTH] here?

Use a temporary char array to hold error message. err_data can be
safely released after ofi_cq_write_err returns. No need to use a
persistent field in ep.

Signed-off-by: Jessie Yang <[email protected]>
Rename efa_dgram_cq.c to efa_cq.c and move it to prov/efa/src as a
common CQ interface for efa-raw’s rdm and dgram ep type.
Create efa_cq_progress and poll ibv cq directly instead of using
efa_dgram_ep_progress.
Remove rcq and scq, write completion to util_ep.rx_cq and tx_cq.
Construct the error message for cq_err_entry.err_data.

Signed-off-by: Jessie Yang <[email protected]>
return -FI_ENOMEM;

err = ofi_cq_init(&efa_prov, domain_fid, attr, &cq->util_cq,
&efa_cq_progress, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a similar change for efa_cntr: https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_cntr.c#L200-L202. Since you make dgram_ep_progress as a no op, ofi_cntr_progress will be a no-op too as it call ep progress in the same way as ofi_cq_progress does. You need a efa_cntr_progress that polls ibv_cq as well

@@ -53,7 +53,7 @@ struct efa_conn *efa_av_addr_to_conn(struct efa_av *av, fi_addr_t fi_addr)
struct util_av_entry *util_av_entry;
struct efa_av_entry *efa_av_entry;

if (OFI_UNLIKELY(fi_addr == FI_ADDR_UNSPEC))
if (OFI_UNLIKELY(fi_addr == FI_ADDR_UNSPEC || fi_addr == FI_ADDR_NOTAVAIL))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called inefa_base_ep_get_peer_raw_addr. We temporarily return FI_ADDR_NOTAVAIL as peer addr for TX operation in efa_cq_handle_error.

cq_entry->buf, cq_entry->data,
cq_entry->tag, FI_ADDR_NOTAVAIL);
else
ret = ofi_cq_write(tx_cq, cq_entry->op_context, cq_entry->flags,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are still two copies one after the other in the CQ read path. Can we eliminate one of the copies?

The CQ read path is (correct me if I'm wrong)

  • Application calls fi_cq_read which we set to ofi_cq_read
  • ofi_cq_read -> ofi_cq_readfrom
    • ofi_cq_readfrom first calls cq->progress(cq) which we set to efa_cq_progress -> efa_cq_poll_ibv_cq -> poll IBV CQ -> ofi_cq_write and we copy the IBV CQ completion into the Libfabric CQ circular queue
    • ofi_cq_readfrom then calls ofi_cq_read_entries and we copy the CQ entry from Libfabric CQ circular queue to the application CQ entry buffer

If the progress was more complicated (progress other CQs or EPs for protocol reasons), then we need the circular queue. But in this case, we really don't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Application calls fi_cq_read with a count. We can stop polling after we get count completions. rdma-core is already buffering the CQ entries, so we can just take advantage of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the progress was more complicated (progress other CQs or EPs for protocol reasons), then we need the circular queue. But in this case, we really don't.

I don't think it matters with how complicated the protocol is. For protocol case, we only write ofi cq when an operation has completed, which may involve multiple ibv cqes. You can always customize a function that propagate the completion to the buf in the fi_cq_read directly to achieve best performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the current dgram code was kind of designed in this direction, see https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/dgram/efa_dgram_cq.c#L161. But then it rewrite the cq entries in buf back to ofi cq https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/dgram/efa_dgram_ep.c#L319-L338. We may reuse the dgram code pattern but remove that ofi_cq_write* phase.

Copy link
Contributor

@shijin-aws shijin-aws Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem I think of is that, if you want to get rid of libfabric's cirque, there is no way for you to make a progress engine while staging the cqes. IIRC fi_cntr_read needs to progress the ep for manual progress model, and that progress function needs to poll ibv cq to update the cntr while staging the cqes for another fi_cq_read to read.

Copy link
Contributor

@shijin-aws shijin-aws Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://ofiwg.github.io/libfabric/main/man/fi_domain.3.html

FI_PROGRESS_MANUAL
This progress model indicates that the provider requires the use of an application thread to complete an asynchronous request. When manual progress is set, the provider will attempt to advance an asynchronous operation forward when the application attempts to wait on or read an event queue, completion queue, or counter where the completed operation will be reported. Progress also occurs when the application processes a poll or wait set that has been associated with the event or completion queue.

If we want to support the case where application use both fi_cq_read and fi_cntr_read to progress the operations, we still need to use cirque in util cq.

wr->num_sge = msg->iov_count;
wr->sg_list = base_ep->efa_recv_wr_vec[wr_index].sge;
wr->wr_id = (uintptr_t) ((flags & FI_COMPLETION) ? msg->context : NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with Jessie. According to the docs for FI_CONTEXT

This structure should be treated as opaque to the application. For performance reasons, this structure must be allocated by the user, but may be used by the fabric provider to track the operation.

So we can use this buffer however we want. I would suggest setting it up as a bitmask and have FI_COMPLETION be one of the bits. That way, we can use the other bits for something else in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against using fi_context buffer to store some metadata like op flags and remote addr, but I am not sure how useful it is to store FI_COMPLETION, as the man page said

If an operation does not generate a completion (i.e. the endpoint was configured with FI_SELECTIVE_COMPLETION and the operation was not initiated with the FI_COMPLETION flag) then the context parameter is ignored by the fabric provider

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants