Skip to content

Commit

Permalink
Add notes & change format
Browse files Browse the repository at this point in the history
  • Loading branch information
Wichai-pan committed Nov 27, 2024
1 parent b72b0bb commit 9369008
Showing 1 changed file with 75 additions and 42 deletions.
117 changes: 75 additions & 42 deletions _posts/HPC/2024-11-27-HPC-Session 06-Non-blocking P2P commuication.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,18 @@ pin: false

## **Buffers**

MPI distinguishes different types of buffers: I variables
MPI distinguishes different types of buffers:

- variables

- user-level buffers
- hardware/system buffers

MPI implementations are excellent in tuning communication, i.e. avoid copying, but we have to assume that a message runs through all buffers, then through the network, and then bottom-up through all buffers again. This means that Send and Recv are expensive operations.

Even worse, two concurrent sends might deadlock (but only for massive message counts or extremely large messages).
Even worse, two concurrent sends might deadlock (but only for massive message counts or extremely large messages). 两个并发发送可能会发生死锁

⇒ One way to deal with this is to allow MPI to optimize the messaging by giving both Send and Recv commands simultaneously — this is a `MPI_Sendrecv`.
⇒ One way to deal with this is to allow MPI to optimize the messaging by giving both Send and Recv commands simultaneously — this is a `MPI_Sendrecv`. 允许 MPI 通过同时给出发送和接收命令来优化消息传递



Expand All @@ -46,15 +48,19 @@ int MPI_Sendrecv(
```

- Shortcut for send followed by receive
- Allows MPI to optimise aggressively
- Anticipates that many applications have dedicated compute and data exchange phases
⇒ Does not really solve our efficiency concerns, just weaken them
发送后接收的快捷方式
- Allows MPI to optimise aggressively
积极优化
- Anticipates that many applications have dedicated compute and data exchange phases
预计许多应用程序都有专门的计算和数据交换阶段

⇒ Does not really solve our efficiency concerns, just weaken them



### `MPI_Sendrecv` example

We have a program which sends an nentries-length buffer between two processes:
We have a program which **sends an nentries-length buffer** between two processes:
```cpp
if (rank == 0) {
MPI_Send(sendbuf, nentries, MPI_INT, 1, 0, ...);
Expand All @@ -65,7 +71,10 @@ if (rank == 0) {
}
```

- Recall that MPI Send behaves like MPI Bsend when buffer space is available, and then behaves like MPI Ssend when it is not.
- Recall that MPI Send
- behaves like `MPI_Bsend` when buffer space is available, 缓冲区空间可用
- and then behaves like `MPI_Ssend` when it is not. 不可用


```cpp
if (rank == 0) {
Expand All @@ -85,8 +94,10 @@ if (rank == 0) {

- Non-blocking commands start with I (immediate return, e.g.)
- Non-blocking means that operation returns immediately though MPI might not have transferred data (mightnot even have started)
- Buffer thus is still in use and we may not overwrite it
- We explicitly have to validate whether message transfer has completed before we reuse or delete the buffer
非阻塞意味着操作立即返回,尽管 MPI 可能尚未传输数据(甚至可能尚未开始)
- Buffer thus is still in use and we may not overwrite it 不能覆盖buffer
- We explicitly have to validate whether message transfer has completed before we reuse or delete the buffer
需要验证消息传输是否已完成,然后才能重用或删除缓冲区
```cpp
// Create helper variable (handle)
int a = 1;
Expand All @@ -105,9 +116,8 @@ a = 2;
## **Why non-blocking. . . ?**

- Added flexibility of separating posting messages from receiving them.
⇒ MPI libraries often have optimisations to complete sends quickly if the matching receive already exists.

- Sending many messages to one process, which receives them all. . .
⇒ MPI libraries often have optimisations to complete sends quickly if the matching receive already exists. 增加收发消息灵活性,若接收方已存在,库会进行优化以快速完成发送
- Sending many messages to one process, which receives them all. . . 接收所有



Expand All @@ -133,8 +143,8 @@ for (int k=0; k<100; k++){


## `Isend` **&** `Irecv`
- Non-blocking variants of `MPI_Send` and `MPI_Recv`
- Returns immediately, but *buffer is not safe to reuse*
- Non-blocking variants of `MPI_Send` and `MPI_Recv` 非阻塞变体
- Returns immediately, but *buffer is not safe to reuse* 立即返回,但缓冲区不安全可重用

```cpp
int MPI_Isend(const void *buffer, int count, MPI_Datatype dtype,
Expand All @@ -145,6 +155,7 @@ int MPI_Irecv(void *buffer, int count, MPI_Datatype dtype,
- Note the `request` in the send, and the lack of status in recv
- We need to process that `request` before we can reuse the buffers
我们需要在重用缓冲区之前处理 `request`
Expand All @@ -164,9 +175,9 @@ int MPI_Wait(MPI_Request *request, MPI_Status *status);

- Pass additional pointer to object of type MPI Request.
- Non-blocking, i.e. operation returns immediately.
- Check for send completition with `MPI_Wait` or `MPI_Test`.
- `MPI_Irecv` analogous.
- The status object is not required for the receive process, as we have to hand it over to wait or test later.
- Check for send completition with `MPI_Wait` or `MPI_Test`. 检查发送状况
- `MPI_Irecv` analogous. 相似
- The status object is not required for the receive process, as we have to hand it over to **wait or test later**. 状态对象不需要用于接收过程,因为我们必须将其交给等待或测试。



Expand All @@ -183,6 +194,9 @@ int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
```
- `flag` will be true (an `int` of value 1) if the provided request has been completed, and false otherwise.
- true - completed
- false - not
- If we don’t want to test for completion, we can instead `MPI_Wait`. . .
Expand Down Expand Up @@ -233,9 +247,9 @@ MPI_Send(buffer2, 10, MPI_INT, right, 0, MPI_COMM_WORLD);
//MPI_Wait(&request2, &status);
```

- Does Variant A deadlock? *Yes!* `MPI_Recv` is always blocking.
- Does Variant B deadlock? Not for only 10 integers (if not too many messages sent before).
- Does Variant C deadlock? Is it correct? Is it fast? May we add additional operations before the first wait?
- Does Variant A deadlock? *Yes!* `MPI_Recv` is always blocking. (always waiting)
- Does Variant B deadlock? Not for only 10 integers (if not too many messages sent before). 若消息数量大 会死锁
- Does Variant C deadlock? Is it correct? Is it fast? May we add additional operations before the first wait? -- does not dead lock. fast but may not correct



Expand All @@ -255,11 +269,13 @@ MPI_Send(buffer2, 10, MPI_INT, right, 0, MPI_COMM_WORLD);
## **Definition: collective**

> Collective operation: A collective (MPI) operation is an operation involving many/all nodes/ranks.
>
> 集体操作:集体(MPI)操作是涉及许多/所有节点/排名的操作。
- In MPI, a collective operation involves all ranks of one communicator (introduced later)
- In MPI, a collective operation involves all **ranks** of one communicator (introduced later)
- For `MPI_COMM_WORLD`, a collective operation involves all ranks
- Collectives are blocking (though newer (>=3.1) MPI standard introduces non-blocking collectives)
- Blocking collectives always synchronise all ranks, i.e. all ranks have to enter the same collective instruction before any rank proceeds
- Collectives are **blocking** (though newer (>=3.1) MPI standard introduces non-blocking collectives)
- Blocking collectives always **synchronise all ranks**, i.e. all ranks have to enter the same collective instruction before any rank proceeds 同步所有等级,即所有等级必须在任何等级继续之前输入相同的集合指令



Expand All @@ -278,7 +294,7 @@ else {
MPI_Send(&a,1,MPI_DOUBLE,0, ...);
```
What type of collective operation is realised here?
What type of collective operation is realised here? (a ???)
```cpp
double globalSum;
Expand All @@ -290,15 +306,19 @@ MPI_Reduce(&a, &globalSum, 1,

## **Flavours of collective operations in MPI**

form
| Type of collective | One-to-all | All-to-one | All-to-all |
| ------------------ | ------------------ | ---------- | ---------- |
| Synchronisation | Barrier | | |
| Communication | Broadcast, Scatter | Gather | Allgather |
| Computation | | Reduce | Allreduce |

Insert the following MPI operations into the table (MPI prefix and signature neglected):

- Barrier
- Broadcast
- Reduce
- Allgather
- Scatter
- Scatter 散开
- Gather
- Allreduce

Expand All @@ -310,22 +330,25 @@ Insert the following MPI operations into the table (MPI prefix and signature neg

![image-20241127021620338](https://wichaiblog-1316355194.cos.ap-hongkong.myqcloud.com/image-20241127021620338.png)

- Simplicity of code
- Simplicity of code 代码的简介

- Performance through specialised implementations
- Support through dedicated hardware (cf. BlueGene’s three network topologies: clique, fat tree, ring)
- Support through dedicated hardware (cf. BlueGene’s three network topologies: clique, fat tree, ring) 专用硬件支持





## MPI Barrier
- Simplest form of collective operation — synchronization of all ranks in comm.

​ 最简单的集体操作形式——同步所有通信等级。

- Rarely used:

⇒ MPI Barrier doesn’t synchronize non-blocking calls
⇒ MPI Barrier doesn’t synchronize non-blocking calls 不同步非阻塞调用

⇒ Really meant for telling MPI about calls *outside* MPI, like IO
⇒ Really meant for telling MPI about calls *outside* MPI, like IO 用于告诉 MPI 关于 MPI 外部调用,比如 IO

```cpp
int rank, size;
Expand All @@ -343,11 +366,18 @@ for ( int ii = 0; ii < size; ++ii ) {
## MPI Bcast **&** MPI Scatter
- MPI Bcast sends the contents of a buffer from root to all other processes.
- MPI Scatter sends *parts* of a buffer from root to different processes.
- MPI Bcast is the inverse of MPI Reduce
- MPI Scatter is the inverse of MPI Gather
## `MPI_Bcast` **&** `MPI_Scatter`
- `MPI_Bcast` sends the contents of a buffer from root to all other processes.
将缓冲区的内容从根进程发送到所有其他进程
- `MPI_Scatter` sends *parts* of a buffer from root to different processes.
将缓冲区的部分从根进程发送到不同的进程
- `MPI_Bcast` is the inverse of `MPI_Reduce` 逆操作
- `MPI_Scatter` is the inverse of `MPI_Gather`
```cpp
Expand All @@ -370,9 +400,12 @@ MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);



## MPI Reduce **&** MPI Gather
- MPI_Reduce reduces a value across ranks to a single value on root using a prescribed reduction operator.
- MPI_Gather concatenates the array pieces from all processes onto the root process.
## `MPI_Reduce` **&** `MPI_Gather`
- `MPI_Reduce` reduces a value across ranks to a single value on root using a prescribed reduction operator.

将跨等级的值减少到根上使用指定的减少运算符的单个值 ?

- `MPI_Gather` concatenates the array pieces from all processes onto the root process. 将所有进程的数组片段连接到根进程


```cpp
Expand All @@ -396,9 +429,9 @@ MPI_Gather(sendarray, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);
## MPI Allgather **&** MPI Allreduce
- MPI Allgather is an MPI Gather which concatenates the array pieces on all processes.
- MPI Allreduce is an MPI Reduce which reduces on all processes.
## `MPI_Allgather` **&** `MPI_Allreduce`
- `MPI_Allgather` is an MPI Gather which concatenates the array pieces on all processes. 将所有进程上的数组片段连接起来
- `MPI_Allreduce` is an `MPI_Reduce` which reduces on all processes. 一个在所有进程上进行减少的 `MPI_Reduce`
```cpp
MPI_Comm comm;
Expand Down

0 comments on commit 9369008

Please sign in to comment.