Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specific communicator for neighborhood communication #1588

Open
wants to merge 14 commits into
base: index-map-pgm
Choose a base branch
from

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Apr 4, 2024

This PR adds a communicator that only handles neighborhood all-to-all communication. It implements the new interface collective_communicator, which provides different implementations for a selected set of collective mpi routines. Currently, this only includes the non-blocking all-to-all.

The communication uses a fixed pattern, i.e. the send/recv sizes are fixed when the neighborhood communicator is constructed. I would have liked to decouple that, but this would require some knowledge of how the sizes are stored at the interface level. If someone has an idea for that, please let me know.

This is the first part of splitting up #1546.

The neighborhood all-to-all has a bug in OpenMPI < v4.1.0, which makes it necessary to disable the neighborhood communicator in this case. As replacement, there is also a dense all-to-all communicator.

Todo:

  • documentation

PR Stack:

@MarcelKoch MarcelKoch self-assigned this Apr 4, 2024
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. labels Apr 4, 2024
@MarcelKoch MarcelKoch requested a review from pratikvn April 4, 2024 10:41
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 6acf7c4 to 8aa6ab9 Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from b42ab92 to 8f104fd Compare April 4, 2024 11:00
@MarcelKoch MarcelKoch modified the milestone: Ginkgo 1.8.0 Apr 5, 2024
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 8aa6ab9 to 77398bd Compare April 17, 2024 16:28
@MarcelKoch MarcelKoch requested a review from upsj April 19, 2024 09:19
@MarcelKoch MarcelKoch mentioned this pull request Apr 19, 2024
7 tasks
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 77398bd to d278cad Compare April 19, 2024 14:39
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 8f104fd to a0824a8 Compare April 19, 2024 14:39
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from d278cad to d6112ef Compare April 22, 2024 11:11
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from a0824a8 to 8ad3f2f Compare April 22, 2024 11:11
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from d6112ef to 1582673 Compare April 25, 2024 07:16
@MarcelKoch MarcelKoch mentioned this pull request Apr 30, 2024
6 tasks
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 1582673 to db9b48a Compare April 30, 2024 13:41
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 8ad3f2f to 26678b3 Compare April 30, 2024 13:41
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from db9b48a to 72eafff Compare April 30, 2024 15:20
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 26678b3 to 006d67d Compare April 30, 2024 15:20
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 72eafff to 3c70106 Compare May 2, 2024 10:04
@MarcelKoch MarcelKoch force-pushed the read-distributed-with-index-map branch from 006d67d to b295b11 Compare May 2, 2024 10:04
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 3c70106 to a1567b8 Compare May 2, 2024 10:06
@MarcelKoch MarcelKoch requested a review from upsj August 27, 2024 12:05
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 1f49b91 to 4db050c Compare October 7, 2024 13:06
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, mostly interface and implementation

  • What is the moved-from state of a Communicator? Should it match that of an MPI communicator wrapper, and be MPI_NULL, or preserve the MPI communicator?

core/distributed/dense_communicator.cpp Outdated Show resolved Hide resolved
core/distributed/dense_communicator.cpp Outdated Show resolved Hide resolved
core/distributed/neighborhood_communicator.cpp Outdated Show resolved Hide resolved
@MarcelKoch
Copy link
Member Author

@upsj I think the moved-from that should be using MPI_COMM_NULL as communicator. Then trying to launch a communication will fail quite badly.

@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 4db050c to 1ebe59f Compare October 23, 2024 13:32
@MarcelKoch MarcelKoch force-pushed the index-map-pgm branch 2 times, most recently from 604a6e9 to ba0982e Compare October 23, 2024 15:13
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from 1ebe59f to e7d32a1 Compare October 30, 2024 15:10
@MarcelKoch MarcelKoch modified the milestones: Ginkgo 1.9.0, Ginkgo 1.10.0 Dec 9, 2024
Comment on lines -111 to +127
#cmakedefine GINKGO_FORCE_SPMV_BLOCKING_COMM
#cmakedefine01 GINKGO_HAVE_OPENMPI_PRE_4_1_X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public interface break?
but I might also consider it is under experimental feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the old macro again, but I'm not really sure how we usually consider macros for interface stability. IMO we either should state which macros are public, and which are private. Because this macro should definitely be private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of them I might consider it is public like version and propabably mixed precision.
This is only used in the experimental feature, so I also think it is free to change

Comment on lines 139 to 135
} // namespace gko

#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} // namespace gko
#endif
} // namespace gko
#endif

Comment on lines 46 to 49
* Default constructor with empty communication pattern
* @param base the base communicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Default constructor with empty communication pattern
* @param base the base communicator
* Default constructor with empty communication pattern
*
* @param base the base communicator

Comment on lines 44 to 48
* @tparam RecvType the type of the elements to receive
* @param exec the executor for the communication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @tparam RecvType the type of the elements to receive
* @param exec the executor for the communication
* @tparam RecvType the type of the elements to receive
*
* @param exec the executor for the communication

Comment on lines 42 to 45
* Default constructor with empty communication pattern
* @param base the base communicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Default constructor with empty communication pattern
* @param base the base communicator
* Default constructor with empty communication pattern
*
* @param base the base communicator

Comment on lines 55 to 59
* @tparam GlobalIndexType the global index type of the map
* @param base the base communicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @tparam GlobalIndexType the global index type of the map
* @param base the base communicator
* @tparam GlobalIndexType the global index type of the map
*
* @param base the base communicator

Comment on lines +94 to +89
* Equality is defined as having identical or congruent communicators and
* their communication pattern is equal. No communication is done, i.e.
* there is no reduction over the local equality check results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Equality is defined as having identical or congruent communicators and
* their communication pattern is equal. No communication is done, i.e.
* there is no reduction over the local equality check results.
* Equality is defined as having identical or congruent communicators and
* their communication pattern is equal. There is no communication need in this function, i.e.
* there is no reduction over the local equality check results.

core/test/mpi/distributed/dense_communicator.cpp Outdated Show resolved Hide resolved
MarcelKoch and others added 13 commits December 18, 2024 10:27
- fix include guards
- update docs
- implement copy/move constructors/assignment with tests
- add equality test for collective communicators (needed for testing)
- always enable neighborhood comm, just throw if openmpi is too old
- define moved-from state as MPI_COMM_NULL + empty sizes/offsets
- remove unnecessary namespace
- make virtual function protected

Co-authored-by: Pratik Nayak <[email protected]>
Co-authored-by: Tobias Ribizel <[email protected]>
- documentation
- formatting

Co-authored-by: Yu-Hsiang M. Tsai <[email protected]>
Signed-off-by: Marcel Koch <[email protected]>
@MarcelKoch MarcelKoch force-pushed the neighborhood-communicator branch from ceb6f2e to 807118c Compare December 18, 2024 09:27
- merge tests

Co-authored-by: Yu-Hsiang M. Tsai <[email protected]>
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

Comment on lines +108 to +109
* This implementation uses the neighborhood communication
* MPI_Ineighbor_alltoallv. See MPI documentation for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong one

Comment on lines +50 to +51
* @param recv_buffer the receive buffer
* @return a request handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param recv_buffer the receive buffer
* @return a request handle
* @param recv_buffer the receive buffer
*
* @return a request handle

Comment on lines +56 to +59
std::partial_sum(send_sizes_.begin(), send_sizes_.end(),
send_offsets_.begin() + 1);
std::partial_sum(recv_sizes_.begin(), recv_sizes_.end(),
recv_offsets_.begin() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remind me this behavior?
the constructor with the default value T() is until c++11, but the others will rely on the allocator, which gives the uninitialized storage?

Comment on lines +41 to +43
* The send_buffer must have allocated at least get_send_size number of
* elements, and the recv_buffer must have allocated at least get_recv_size
* number of elements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only accepts the plain pointer.
Should we use the gko::array to ensure the size fulfill the requirement?

include/ginkgo/core/base/mpi.hpp Show resolved Hide resolved
Comment on lines +170 to +171
ASSERT_TRUE(moved == spcomm);
ASSERT_TRUE(moved_from == empty_comm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_TRUE(moved == spcomm);
ASSERT_TRUE(moved_from == empty_comm);
ASSERT_EQ(moved, spcomm);
ASSERT_EQ(moved_from, empty_comm);

Comment on lines +185 to +186
ASSERT_TRUE(moved == spcomm);
ASSERT_TRUE(moved_from == empty_comm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_TRUE(moved == spcomm);
ASSERT_TRUE(moved_from == empty_comm);
ASSERT_EQ(moved, spcomm);
ASSERT_EQ(moved_from, empty_comm);

using communicator_type = typename TestFixture::communicator_type;
using part_type = typename TestFixture::part_type;
using map_type = typename TestFixture::map_type;
auto part = gko::share(part_type::build_from_global_size_uniform(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use create_default_comm()?

using communicator_type = typename TestFixture::communicator_type;
using part_type = typename TestFixture::part_type;
using map_type = typename TestFixture::map_type;
auto part = gko::share(part_type::build_from_global_size_uniform(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use create_default_comm()?


auto req = spcomm.i_all_to_all_v(this->ref, static_cast<int*>(nullptr),
static_cast<int*>(nullptr));
req.wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
req.wait();
ASSERT_NO_THROW(req.wait());

maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review mod:core This is related to the core module. reg:build This is related to the build system. reg:testing This is related to testing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants