[WIP] Implement IPC for pyspark. #11564

trivialfis · 2022-08-18T16:35:01Z

This PR implements a pair of methods for exporting and importing the CUDA IPC memory handle. As described in the issue, this feature is mostly for pyspark where data needs to be copied between two processes with different language envs.

This is still a working-in-progress PR seeking comments from maintainers. Please ignore unrelated changes in the build script.

In the implementation, I called in CUDA driver API to workaround the issue of IPC memory handle mentioned in NVIDIA/spark-rapids#5561 (comment) . For a brief summary of the issue with RMM, CUDA IPC functions return the same IPC handle for the memory region with the same base pointer(the pointer returned by cudaMalloc). For instance, ptr and ptr + 4 have the same IPC memory handle, which breaks the pool allocator. In the PR, I calculate the offset upon creating IPC handle and export it as part of the IPC message. Lastly, the IPC message in the current implementation is simply a binary buffer instead of a more complicated JSON document.

We discussed the possibility of using Arrow IPC. It doesn't work for our purpose for a couple of reasons. Firstly arrow doesn't handle the pointer offset described above. Secondly, this PR ensures that there's no data copy inside c++, which can not be easily done with arrow structures like record batch.

Supported features

RMM pool
Python
Java
Primitive types like float, int.
list type.

to-do

Other composite types including string and struct.
Better tests.

shwina · 2022-08-19T11:29:50Z

cpp/src/interop/ipc.hpp

+{
+  if (res != CUDA_SUCCESS) {
+    char const* msg;
+    if (cuGetErrorString(res, &msg) != CUDA_SUCCESS) {


Just FYI, calls to the CUDA driver library will have to be via dlopen after #11370.

Thanks for the reference, will look into creating a dlopen wrapper.

This might be a little bit trickier than I thought. With dlopen, I need to initialize the cuda context myself.

cpp/src/interop/ipc.hpp

jrhemstad

I'm trying to wrap my head around the big picture of why this code is needed in the first place.

There is a lot of new new code here for IPC, but we've been supporting CUDA IPC for years already in Pyhon/UCX without needing any of the code here. So what has changed?

trivialfis · 2022-08-19T19:10:14Z

This is for an optimization of pyspark user defined function. When running UDF, spark needs to copy the data from device to host and from java process to Python process. But they are using the same device hence we want to bypass all the copies to save as much memory and time as possible.

Ideally we want to patch arrow to make it do the job for handling these requirements, but it involves a long chain of actions (DF, arrow table, batches, cuda buffer, IPC handles, then another sequence of chunked readers) and a couple of data copies to achieve the transfer. Also, arrow GPU is required.

kkraus14 · 2022-08-22T14:57:50Z

Ideally we want to patch arrow to make it do the job for handling these requirements, but it involves a long chain of actions (DF, arrow table, batches, cuda buffer, IPC handles, then another sequence of chunked readers) and a couple of data copies to achieve the transfer.

Nothing about Arrow should require data copies here unless you want to use the Arrow IPC format. A libcudf table should be able to be reinterpreted as an Arrow RecordBatch zero copy. Regardless, if you want to IPC each buffer separately, it doesn't matter if you have a libcudf table or an Arrow RecordBatch as there isn't existing machinery to do this nor does it probably make sense as the overhead of opening all of the IPC handles to different buffers (can't assume you have a memory pool or all buffers are backed by a single allocation in a pool) would be pretty significant.

shwina · 2022-08-22T14:59:27Z

Hi @trivialfis - before making further progress here, could we please conclude the discussion in #11514?

My sense is that UCX can help abstract away some of the details relating to CUDA IPC, and could present an alternative solution to supporting IPC with pyspark. If not, we can always come back to this PR and work to get it merged.

trivialfis · 2022-08-22T16:41:33Z

@kkraus14 Thank you for the comment.

Nothing about Arrow should require data copies here unless you want to use the Arrow IPC format.

Yes, I want to use the IPC format along with arrow functions to read and write it. This way I can avoid reinventing the wheel as arrow is the underlying format used by spark/pyspark during transfer.

the overhead of opening all of the IPC handles to different buffers

The problem is mostly memory consumption and device <-> host memory copy. From benchmark results by our spark team, the transfer time can be quite significant (Data size = 36 million rows, 37 columns of ints and floats (4 bytes per entry) with 70 to 80 seconds). In comparison, using CUDA IPC handles are much cheaper and scales better in terms of data size. As for the memory usage concern, that's mostly for using pyspark with XGBoost. Over the past, the memory usage question has been the primary issue we are trying to address. As a result, every single removed data copy is deeply appreciated.

abellina · 2022-08-22T21:12:21Z

UCX is a pretty large requirement for what really is the transfer of metadata that is opaque to cuDF and a single IPC handle. In addition to that, it imposes requirements and extra configs on the user, especially in java-land where we have to pass environment variables to make sure it won't crash when loading (we need to set UCX_ERROR_SIGNALS to empty or else it will crash since the JVM likes to use signals and UCX's error handlers catch those signals and can lead to segfault pretty regularly).

Maybe I am confused, but shouldn't this code call contiguous_split with no splits, so that it returns a single buffer and the metadata that cuDF needs to reconstruct it later? That is not entirely zero-copy, but it makes the construction and deconstruction of the table much easier, it is handled via the table_view unpack(uint8_t const* metadata, uint8_t const* gpu_data) call.

@trivialfis did you try this already? It also means that we shouldn't need to change cuDF at all ideally, the thing that sends/receives this metadata buffer and the serialized ipc handle treat them as opaque data.

trivialfis · 2022-08-22T21:23:27Z

@wbo4958 tried pack and unpack without RMM fix. I went with this per-column IPC to avoid data concatenation and copy due to memory constraint with XGBoost (a ml library that requires full data being available on GPU memory).

github-actions · 2022-09-23T12:04:27Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

trivialfis added 30 commits August 18, 2022 20:17

Revert CMake changes.

f4c424e

Start working on it.

75a4340

Get something to compile.

9d1a880

Cleanup.

63dc09d

From arrow.

89d3b37

Start working on Python.

a73cfa1

Mess up with definitions.

2805c6e

cython.

a8b5034

Return buffer instead.

ea2c37d

Define my own ipc handle.

57d6569

Remove context.

ce98beb

Start working on mask.

7253b3d

Decode string.

9f46712

Avoid copying data.

5cf7848

rename.

a055a98

rename.

8721473

Cleanup.

cf1e1db

Move.

7f72cfd

Cleanup & doc.

dd9b74c

Comment.

2969911

typo.

78562c9

Start working on simple test.

f6f41da

Fix.

688848a

Revert cmake changes.

933f6d6

Remove.

649d3a8

probing.

4599589

start messing up with java.

78feb50

Revert jni build.

fdd160b

[Java] Add exportIPC.

30d387a

importIPC.

d58dea8

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 18, 2022

trivialfis added 2 commits August 19, 2022 01:12

cleanup.

a1545bb

Cleanup.

fc6fa27

trivialfis mentioned this pull request Aug 18, 2022

[FEA] Interchange protocol between processes on the same device. #11514

Open

trivialfis added 4 commits August 19, 2022 13:32

Add magic number.

8e2c697

Additional check.

95ecd63

Check input type.

dd2996d

Assert.

01a6b14

shwina reviewed Aug 19, 2022

View reviewed changes

jrhemstad reviewed Aug 19, 2022

View reviewed changes

cpp/src/interop/ipc.hpp Outdated Show resolved Hide resolved

jrhemstad requested changes Aug 19, 2022

View reviewed changes

trivialfis added 4 commits August 22, 2022 16:26

Fix column names.

7cc228e

Cleanup.

53a3f05

Start working on list type.

8759ced

Correct schema.

c0d178d

trivialfis mentioned this pull request Aug 22, 2022

[FEA] A Python package for hanlding customization. NVIDIA/spark-rapids#6391

Open

upsj removed their request for review August 24, 2022 11:54

wbo4958 mentioned this pull request Aug 29, 2022

[WIP] POC for supporting cuda ipc for XGBoost scenario NVIDIA/spark-rapids#6440

Closed

github-actions bot added the inactive-30d label Sep 23, 2022

vyasr removed the inactive-30d label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement IPC for pyspark. #11564

[WIP] Implement IPC for pyspark. #11564

trivialfis commented Aug 18, 2022 •

edited

Loading

shwina Aug 19, 2022

trivialfis Aug 19, 2022

trivialfis Aug 22, 2022

jrhemstad left a comment

trivialfis commented Aug 19, 2022 •

edited

Loading

kkraus14 commented Aug 22, 2022

shwina commented Aug 22, 2022

trivialfis commented Aug 22, 2022 •

edited

Loading

abellina commented Aug 22, 2022

trivialfis commented Aug 22, 2022 •

edited

Loading

github-actions bot commented Sep 23, 2022

[WIP] Implement IPC for pyspark. #11564

Are you sure you want to change the base?

[WIP] Implement IPC for pyspark. #11564

Conversation

trivialfis commented Aug 18, 2022 • edited Loading

shwina Aug 19, 2022

Choose a reason for hiding this comment

trivialfis Aug 19, 2022

Choose a reason for hiding this comment

trivialfis Aug 22, 2022

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

trivialfis commented Aug 19, 2022 • edited Loading

kkraus14 commented Aug 22, 2022

shwina commented Aug 22, 2022

trivialfis commented Aug 22, 2022 • edited Loading

abellina commented Aug 22, 2022

trivialfis commented Aug 22, 2022 • edited Loading

github-actions bot commented Sep 23, 2022

trivialfis commented Aug 18, 2022 •

edited

Loading

trivialfis commented Aug 19, 2022 •

edited

Loading

trivialfis commented Aug 22, 2022 •

edited

Loading

trivialfis commented Aug 22, 2022 •

edited

Loading