prov/efa: Implement the cq progress #10659

jiaxiyan · 2024-12-23T21:12:42Z

Rename efa_dgram_cq.c to efa_cq.c and move it to prov/efa/src as a common CQ interface for efa-raw’s rdm and dgram ep type.
Create efa_cq_progress and poll ibv cq directly instead of using efa_dgram_ep_progress.
Remove rcq and scq, write completion to util_ep.rx_cq and tx_cq.
Construct the error message for cq_err_entry.err_data.

shijin-aws

Some initial comments, will take a closer look later

prov/efa/src/efa_cq.c

shijin-aws · 2024-12-27T06:14:32Z

prov/efa/src/efa_cq.c

+
+	addr = efa_av_reverse_lookup(base_ep->av,
+				     ibv_wc_read_slid(ibv_cq_ex),
+				     ibv_wc_read_src_qp(ibv_cq_ex));


This is wrong, ibv_wc_read_src_qp only applies to the src addr, which means it only applies to the rx operations (recv or recv imm). We need to determine the destination qpn for a send call. If there is an rdma-core API that does it, we should use it. If it doesn't exist, we need a plan to how to get such addr.

I can't find one API for this. Can we save the msg->addr to be part of fi_context?

we can, but that doesn't work for inject calls where you don't have a fi_context.

prov/efa/src/efa_cq.c

shijin-aws · 2024-12-27T06:27:41Z

prov/efa/src/efa_cq.c

+
+void efa_cq_progress(struct util_cq *cq)
+{
+	efa_cq_poll_ibv_cq(efa_env.efa_cq_read_size, cq);


As I mentioned earlier, we may need a domain level lock here to protect the qp_table which is also accessed by the ep_enable and ep_close calls. But I am ok with making it as part of a follow up change when we implement the ep interface for efa raw.

shijin-aws · 2024-12-27T06:34:11Z

prov/efa/src/efa_cq.c

+	}
+
+	if (cq_entry->flags != 0)
+		efa_cntr_report_rx_completion(&base_ep->util_ep, cq_entry->flags);


Same comments, IIRC cntr should always be updated no matter a cq entry is written or not. (Double check man page please)

Man page says Counters record the number of requested operations that have completed. Does that mean update cntr for no completion(null wr_id) requests too?

Yes, look at the rdm code, it update cntr for inject calls as well where we don't write cqe

Looks like rdm does not update cntr if ofi_cq_write fails

That is a bug then

prov/efa/src/efa_cq.c

prov/efa/src/efa_msg.c

prov/efa/src/efa_cq.c

prov/efa/src/efa_cq.h

prov/efa/src/efa_cq.c

prov/efa/src/efa_cq.h

shijin-aws · 2024-12-29T23:43:22Z

prov/efa/src/efa_cq.h

+		return -FI_ENOBUFS;
+	}
+
+	*buf = err_msg;


Are you sure this will work? Your err_msg will be released when the function returns as it is declared within the stack.

I guess it might cause undefined behavior. I will move err_msg to be a field of base_ep then.

Do you have to use a persistent err_msg in base_ep ? If you read the ofi_cq_write_err code, you will find it will mem_dup the err_data you filled in entry. In other words, the err_data you passed can be safely released after ofi_cq_write_err returns. Just use a temporary char array in your code block that write the error cq.

Move efa_use_unsolicited_write_recv to efa.h so they can be used outside rdm. Signed-off-by: Jessie Yang <[email protected]>

shijin-aws · 2024-12-30T22:38:59Z

prov/efa/src/efa_base_ep.h

@@ -65,6 +66,7 @@ struct efa_base_ep {
 	/* Only used by RDM ep type */
 	struct efa_qp *user_recv_qp; /* Separate qp to receive pkts posted by users */
 	struct efa_recv_wr *user_recv_wr_vec;
+	char err_msg[EFA_ERROR_MSG_BUFFER_LENGTH]; /* A large enough buffer to store CQ/EQ error data used by e.g. fi_cq_readerr */


Paste my earlier comment here again

Do you have to use a persistent err_msg in base_ep ? If you read the ofi_cq_write_err code, you will find it will mem_dup the err_data you filled in entry. In other words, the err_data you passed can be safely released after ofi_cq_write_err returns. Just use a temporary char array in your code block that write the error cq.

Do you prefer to remove err_msg from rdm_ep too?

prov/efa/src/efa_cq.c

Move efa_rdm_ep_raw_addr_str, efa_rdm_ep_get_peer_raw_addr, efa_rdm_ep_get_peer_raw_addr_str to base_ep.h so they can be used by efa-raw. Signed-off-by: Jessie Yang <[email protected]>

shijin-aws · 2024-12-31T01:42:56Z

prov/efa/src/rdm/efa_rdm_cq.c

@@ -348,11 +348,11 @@ static void efa_rdm_cq_handle_recv_completion(struct efa_ibv_cq *ibv_cq, struct
 	 * QP and we cannot cancel that.
 	 */
 	if (OFI_UNLIKELY(ep->use_zcpy_rx && efa_rdm_pkt_type_is_rtm(pkt_type))) {
-		void *errbuf;
+		void *errbuf = {0};


Shouldn't it be a new err_msg[EFA_ERROR_MSG_BUFFER_LENGTH] here?

prov/efa/src/rdm/efa_rdm_ope.c

Use a temporary char array to hold error message. err_data can be safely released after ofi_cq_write_err returns. No need to use a persistent field in ep. Signed-off-by: Jessie Yang <[email protected]>

Rename efa_dgram_cq.c to efa_cq.c and move it to prov/efa/src as a common CQ interface for efa-raw’s rdm and dgram ep type. Create efa_cq_progress and poll ibv cq directly instead of using efa_dgram_ep_progress. Remove rcq and scq, write completion to util_ep.rx_cq and tx_cq. Construct the error message for cq_err_entry.err_data. Signed-off-by: Jessie Yang <[email protected]>

Signed-off-by: Jessie Yang <[email protected]>

shijin-aws · 2024-12-31T17:59:51Z

prov/efa/src/efa_cq.c

+		return -FI_ENOMEM;
+
+	err = ofi_cq_init(&efa_prov, domain_fid, attr, &cq->util_cq,
+			  &efa_cq_progress, context);


You need a similar change for efa_cntr: https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_cntr.c#L200-L202. Since you make dgram_ep_progress as a no op, ofi_cntr_progress will be a no-op too as it call ep progress in the same way as ofi_cq_progress does. You need a efa_cntr_progress that polls ibv_cq as well

sunkuamzn · 2024-12-31T19:54:08Z

prov/efa/src/efa_av.c

@@ -53,7 +53,7 @@ struct efa_conn *efa_av_addr_to_conn(struct efa_av *av, fi_addr_t fi_addr)
 	struct util_av_entry *util_av_entry;
 	struct efa_av_entry *efa_av_entry;

-	if (OFI_UNLIKELY(fi_addr == FI_ADDR_UNSPEC))
+	if (OFI_UNLIKELY(fi_addr == FI_ADDR_UNSPEC || fi_addr == FI_ADDR_NOTAVAIL))


Can you explain this change?

This function is called inefa_base_ep_get_peer_raw_addr. We temporarily return FI_ADDR_NOTAVAIL as peer addr for TX operation in efa_cq_handle_error.

sunkuamzn · 2024-12-31T20:22:23Z

prov/efa/src/efa_cq.c

+				       cq_entry->buf, cq_entry->data,
+				       cq_entry->tag, FI_ADDR_NOTAVAIL);
+	else
+		ret = ofi_cq_write(tx_cq, cq_entry->op_context, cq_entry->flags,


Looks like there are still two copies one after the other in the CQ read path. Can we eliminate one of the copies?

The CQ read path is (correct me if I'm wrong)

Application calls fi_cq_read which we set to ofi_cq_read

ofi_cq_read -> ofi_cq_readfrom

ofi_cq_readfrom first calls cq->progress(cq) which we set to efa_cq_progress -> efa_cq_poll_ibv_cq -> poll IBV CQ -> ofi_cq_write and we copy the IBV CQ completion into the Libfabric CQ circular queue

ofi_cq_readfrom then calls ofi_cq_read_entries and we copy the CQ entry from Libfabric CQ circular queue to the application CQ entry buffer

If the progress was more complicated (progress other CQs or EPs for protocol reasons), then we need the circular queue. But in this case, we really don't.

Application calls fi_cq_read with a count. We can stop polling after we get count completions. rdma-core is already buffering the CQ entries, so we can just take advantage of that.

If the progress was more complicated (progress other CQs or EPs for protocol reasons), then we need the circular queue. But in this case, we really don't.

I don't think it matters with how complicated the protocol is. For protocol case, we only write ofi cq when an operation has completed, which may involve multiple ibv cqes. You can always customize a function that propagate the completion to the buf in the fi_cq_read directly to achieve best performance.

It seems the current dgram code was kind of designed in this direction, see https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/dgram/efa_dgram_cq.c#L161. But then it rewrite the cq entries in buf back to ofi cq https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/dgram/efa_dgram_ep.c#L319-L338. We may reuse the dgram code pattern but remove that ofi_cq_write* phase.

One problem I think of is that, if you want to get rid of libfabric's cirque, there is no way for you to make a progress engine while staging the cqes. IIRC fi_cntr_read needs to progress the ep for manual progress model, and that progress function needs to poll ibv cq to update the cntr while staging the cqes for another fi_cq_read to read.

See https://ofiwg.github.io/libfabric/main/man/fi_domain.3.html

FI_PROGRESS_MANUAL
This progress model indicates that the provider requires the use of an application thread to complete an asynchronous request. When manual progress is set, the provider will attempt to advance an asynchronous operation forward when the application attempts to wait on or read an event queue, completion queue, or counter where the completed operation will be reported. Progress also occurs when the application processes a poll or wait set that has been associated with the event or completion queue.

If we want to support the case where application use both fi_cq_read and fi_cntr_read to progress the operations, we still need to use cirque in util cq.

sunkuamzn · 2024-12-31T20:37:43Z

prov/efa/src/efa_msg.c

 	wr->num_sge = msg->iov_count;
 	wr->sg_list = base_ep->efa_recv_wr_vec[wr_index].sge;
+	wr->wr_id = (uintptr_t) ((flags & FI_COMPLETION) ? msg->context : NULL);


I discussed with Jessie. According to the docs for FI_CONTEXT

This structure should be treated as opaque to the application. For performance reasons, this structure must be allocated by the user, but may be used by the fabric provider to track the operation.

So we can use this buffer however we want. I would suggest setting it up as a bitmask and have FI_COMPLETION be one of the bits. That way, we can use the other bits for something else in the future.

I am not against using fi_context buffer to store some metadata like op flags and remote addr, but I am not sure how useful it is to store FI_COMPLETION, as the man page said

If an operation does not generate a completion (i.e. the endpoint was configured with FI_SELECTIVE_COMPLETION and the operation was not initiated with the FI_COMPLETION flag) then the context parameter is ignored by the fabric provider

jiaxiyan force-pushed the efa_cq branch 8 times, most recently from 89e563a to 5481222 Compare December 27, 2024 01:08