Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Io_uring integration #736

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

Io_uring integration #736

wants to merge 45 commits into from

Conversation

CPestka
Copy link
Contributor

@CPestka CPestka commented May 28, 2024

Basic io_uring integration for async i/o

This Pr introduces a simple multi-threaded wrapper around io_uring that aims to eliminate some of the intricacies of io_uring in general and more specifically using it as a shared resource, from multiple threads (distribution of requests, keeping state of request meta data, out-of-order arrival, request re-submission, threadsafe access to the actual rings, doing the entire thing with at least ok performance, etc.)

Goal

This Pr is intended to be the base of / alongside a couple follow up Pr's that will introduce Zarr support, "async support" for a few kernels (like the Aggregate kernel) and wiring this up through the compiler with the DSL, to finally show a sensible async i/o + processing example accessible from the DSL level.
This Pr, thus provides the async io part of that goal and the scope of this implementation is thus mostly limited to io_uring features that are needed for this (i.e. mostly just simple read/write ops) and you won't find any of the fancier features of io_uring here (although most could be added, if somebody needs them, relatively easily). This goes similarly for the multi-threading part of this. It is rather basic and could definitively be improved, but I think it is ok for now.

One thing that will need to be addressed before this could be merged

How to address the liburing dependency. For now I just added the linking of the lib to the CMake stuff, but did not add any way to automatically fetch and build the dependency. The last time I needed liburing as a dependency for a cpp project i did not find a "clean" "CMake-native" way to add liburing as a dependency (but i did not look to closely, and that was some time ago, so that could have changed in the mean time). What would be the preferred way to add this dependency (submodule, simply add it to the dependency list, looking for an "native" CMake solution etc.)? I vaguely remember that liburing already came up in the CSD context. I had a quick look at the fork, but did not find the place where liburing got handled there. How has it been resolved there?

Other possible things to be discussed:

  1. C++20. Now actually "needed" for the notify() / wait() calls on atomic variables.
  2. Thread management. Generally I think it would be better, if the threads used here would also be provided by the scheduler, when needed, to not have a bunch of different, independent threadpools lying around, that do not know of each other. However, I think sensibly integrating that in this case could be pretty involved, as that would mean that the compiler needs to be made aware when these threads need to be supplied, not only based on what operations where generally requested (so a compile time question) and also when they are actually needed and not needed anymore (so a runtime question, based on the current progress). The current solution does not include such a compiler integration and as a compromise simply puts the submission/completion user space threads to sleep, if there are no requests to be submitted or in flight currently. Again, I'm not super happy with this, but it works, at least as a temporary solution.

CPestka and others added 22 commits April 17, 2024 15:35
compilation issue caused by fully generic free == operator
- Fixed minor conflicts that could not be resolved automatically:
  - src/runtime/local/datastructures/Matrix.h
    - operator<< was added upstream
  - src/runtime/local/kernels/CheckEq.h
    - kernel specialization for superclass Matrix was added upstream
    - I moved the implementation to a new operator== in Matrix, and replaced the checkEq implementation by a `*lhs == *rhs`, just like it was done for the other data structures in this PR
- Removed CSRMatrix::getNumItems() since it's inherited from Matrix.
- Made the constructors and destructors of ContiguousTensor/ChunkedTensor private.
  - For consistency with DenseMatrix, CSRMatrix, and Frame.
  - DataObjectFactory is meant to be the only point where the constructors and destructors of DAPHNE data objects are called.
- Throw an exception instead of returning 0 in ContiguousTensor/ChunkedTensor::serialize().
  - Otherwise, the error could easily be ignored.
- Throw an exception instead of calling std::abort() in Tensor::slice()/sliceRow()/sliceCol().
  - Otherwise, it's not clear what consequences an uncontrolled program termination could have (e.g., no clean-up of resources).
- Some more minor changes.
  - E.g., related to comments, year in license headers of new files, etc.
@niclashedam niclashedam self-requested a review May 31, 2024 07:06
@CPestka CPestka mentioned this pull request Jun 3, 2024
@CPestka
Copy link
Contributor Author

CPestka commented Jun 5, 2024

Maybe a couple small clarifications

i did not find a "clean" "CMake-native" way to add liburing as a dependency

Here I meant that you can not just e.g. cmakes FetchContent_ functions, as liburing does not have a cmake based build.

And for the possibility to handle it via a submodule/simply wget-ing it and as well as building it via the build.sh, as is the case for most of the other deps, the default install step uses sudo, since the default install location is /user, which can be changed. But the installing still fails without sudo if these are set to somewhere where sudo is not needed. I havent looked into what exactly is failing in the script yet, since it is a bunch of sed hieroglyphics, but it definitively does not work out of the box without sudo, even if the paths are set to somewhere where permissions are not an issue. So the options here would be to skip the default install step and manually move the header and static lib where they belong in the daphne repo (i.e. thirdparty/installed/bin and lib) in the build.sh or rely on the default install, but then require sudo for the build.sh.

Since this is about the build system and imo all options are not great, i did not go ahead and simply pick one. @pdamme what would be the preference here?

  1. Add it to the dependency list and let the user worry about it
  2. Fetch and build via the build.sh, use the default system wide install and from now on require sudo for the build.sh
  3. Fetch and build via the build.sh + even more bash in the already pretty large build.sh to install the header and static lib manually

@niclashedam niclashedam requested a review from pdamme June 6, 2024 13:11
Copy link
Collaborator

@niclashedam niclashedam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good.

But before we merge, we need to clarify a few things:
a) Is C++ standard 20 ok? It seems risky to just change it without understanding the consequences.
b) io_uring must be automatically fetched and linked. This will also become relevant in #656.
c) There are many magic constants in your code. I would like you to comment, at least, why they are what they are. to increase maintainability.
d) Can you provide a sample .daphne script that utilises your io_uring implementation?

I would like @pdamme's review as well.

CMakeLists.txt Outdated
@@ -29,7 +29,7 @@ project(daphne-prototype LANGUAGES CXX C)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_BUILD_WITH_INSTALL_NAME_DIR ON)
set(CMAKE_CXX_STANDARD 17 CACHE STRING "C++ standard to conform to")
set(CMAKE_CXX_STANDARD 20 CACHE STRING "C++ standard to conform to")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand the consequences of changing the C++ standard for the DAPHNE use-cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was already some initial discussion with @pdamme in #694 regarding this. I think moving to a C++20-compliant compiler is fine as long as the documentation is clear about this. It might be a good idea to separate a potential switch to C++20 out of this PR and push it isolated so that we can (a) bring it in quicker and (b) revert if we realize that things break. Accordingly, the CI would need to be updated as well.

Re: Use Cases: Do we know anything about the environments where Daphne is run? Looking at our current dependency management I doubt that we do.

src/runtime/local/io/io_uring/Container.h Outdated Show resolved Hide resolved
src/runtime/local/io/io_uring/IO_Threadpool.cpp Outdated Show resolved Hide resolved
src/runtime/local/io/io_uring/IO_Threadpool.cpp Outdated Show resolved Hide resolved
src/runtime/local/io/io_uring/IO_URing.cpp Show resolved Hide resolved
@pdamme
Copy link
Collaborator

pdamme commented Jun 7, 2024

Thanks for this PR, @CPestka and @MarcusParadies, and thanks for your helpful review, @niclashedam. I will look at the code by today EOD and share my comments. In general I'd say if we need C++20 for a certain component, we should upgrade; if that's problematic in certain relevant cases/environment, introducing a switch like @MarcusParadies suggests sounds like a good idea to me.

@CPestka
Copy link
Contributor Author

CPestka commented Jun 7, 2024

c) There are many magic constants in your code. I would like you to comment, at least, why they are what they are. to increase maintainability.

While I don't agree with any, but the loop bound in IO_Threadpool.cpp, to be magic numbers, as they are (mostly) sensibly (imo) named constants, I moved some of them to (defaulted) function/constructor parameters and added some comments to the rest.


Can you provide a sample .daphne script that utilizes your io_uring implementation?

No. As stated in the beginning.

Goal
This Pr is intended to be the base of / alongside a couple follow up Pr's that will introduce Zarr support, "async support" for a few kernels (like the Aggregate kernel) and wiring this up through the compiler with the DSL, to finally show a sensible async i/o + processing example accessible from the DSL level.

As was discussed in the meetings the entire work for the tensor / zarr / io_uring integration was supposed to be divided into smaller prs. This Pr only provides the async i/o capabilities + unit tests. #694, #742, #741 + this pr + some more follow up prs will be the base for the final pr, which will introduce the dsl integration.


b) io_uring must be automatically fetched and linked. This will also become relevant in #656.

The linking part is already done via CMake here. As for fetching and building liburing, again I first want to know what way of addressing the dep would be ok with everybody. I could just go a head and simply add the 4 lines needed for version 2. to the build script, but that comes with the caveats mentioned earlier.

  git clone / submodule / wget w.e.
  ./configure --cc="$CC" --cxx="$CXX"
  make -j$(nproc)
  sudo make install

C++20: @pdamme suggested in #694 that I should go ahead and use it in the follow up prs, which I did here. I agree though that it may be better do the switch in a separate pr, which should also update the minimum compiler versions in the docu as well as update the ci setup.


And speaking of the ci. Maybe while it is being updated anyways, we could add some sanitizer runs to the CI. I added switches to toggle the different sanitizers here, but did not wire it up to the ci. However running daphne currently with any of the -fsanitize options simply crashes for me. Running the daphne tests with valgrind unfortunatly takes a very long time, as it is in much slower than the build in sanitizer in my experience, but does not crash. As running valgrind over all tests just took to long I ran it only over a small subset of the tests (upstream/main; ./test.sh [kernels]) which resulted in 19.9k lines of error output (see attached log file), which some may argue may be a bit of an issue.
valgrind_log_run_tests_kernels.txt

@CPestka
Copy link
Contributor Author

CPestka commented Jun 12, 2024

b) the ability to override some aspects of the implementation. In this implementation, the queue seems to be tailored specifically for I/Os judging from the URingRead, URingReadInternal and the concrete URing struct. Assume we want to enqueue custom SQE structs to a device fd -- I don't think that is possible with this implementation.

When i wrote this i assumed other op types would likely never be needed here (e.g. all the networking or utility stuff), so what is currently there only supports simple read and write ops. Adding support for another op would not be very hard though, essentially this would mean adding another member to the ops enum, adding a meta data struct and adding the SubmitYourOppTypeHere() / Handle..() functions. I haven't looked at the actual code in the csd integration, but i would assume you would need IORING_OP_URING_CMD for nvme passthrough (right?). Again i assumed other ops would not be needed here so adding new ops is a bit verbose in the current setup and e.g. switching to unions tagged with the op type would probably be a good idea then, as that would reduce additional code duplication.

Generally though, as I dont really know what your requirements are, Im not sure if this implementation gives you any advantages at all. If you "only" submit one request, block till you receive the cqe and then carry on and dont want to share your io_uring instance(s), then having more than one instance is pointless and having a simpler completion interface that does not care about ooo cqe arrival does not give you any benefit, as you dont share your instance and thus only ever have one request in flight at once. So adding the capabilities you need to this may not give you any real benefit and may be more code than simply writing what you need without integrating it with this. On the other hand it may still be nice to have all io_uring related code in one place. I dont really have strong feelings about either approach here.

a) ease of integration with different system areas (in this case, the accelerator backends).

Not sure what you mean here.

@CPestka
Copy link
Contributor Author

CPestka commented Jun 12, 2024

I assume the current build failure is due to the ci machine pulling the docker image rather than using the now updated build.sh

@corepointer
Copy link
Collaborator

That took some time to catch up on this conversation 😰

  • sudo is a no go for people on large installations - It would restrict most users to systems they control themselves (e.g., no more super computer usage)
  • Switching to C++20 sounds feasible after the 0.3 release. We'll upgrade a lot of deps after that anyways.
  • The GitHub CI works without compiling dependencies (they come from the precompiled Docker image). So changes to build.sh that concern dependencies are disregarded there and need to be incorporated in a separate step.

@niclashedam
Copy link
Collaborator

b) the ability to override some aspects of the implementation. In this implementation, the queue seems to be tailored specifically for I/Os judging from the URingRead, URingReadInternal and the concrete URing struct. Assume we want to enqueue custom SQE structs to a device fd -- I don't think that is possible with this implementation.

When i wrote this i assumed other op types would likely never be needed here (e.g. all the networking or utility stuff), so what is currently there only supports simple read and write ops. Adding support for another op would not be very hard though, essentially this would mean adding another member to the ops enum, adding a meta data struct and adding the SubmitYourOppTypeHere() / Handle..() functions. I haven't looked at the actual code in the csd integration, but i would assume you would need IORING_OP_URING_CMD for nvme passthrough (right?). Again i assumed other ops would not be needed here so adding new ops is a bit verbose in the current setup and e.g. switching to unions tagged with the op type would probably be a good idea then, as that would reduce additional code duplication.

Generally though, as I dont really know what your requirements are, Im not sure if this implementation gives you any advantages at all. If you "only" submit one request, block till you receive the cqe and then carry on and dont want to share your io_uring instance(s), then having more than one instance is pointless and having a simpler completion interface that does not care about ooo cqe arrival does not give you any benefit, as you dont share your instance and thus only ever have one request in flight at once. So adding the capabilities you need to this may not give you any real benefit and may be more code than simply writing what you need without integrating it with this. On the other hand it may still be nice to have all io_uring related code in one place. I dont really have strong feelings about either approach here.

a) ease of integration with different system areas (in this case, the accelerator backends).

Not sure what you mean here.

Yes, we would submit one or a few SQEs to a private queue in CS. We would not block but instead, check for CQE at a later point in time.

As for A, integration with other areas of the system is not necessarily straightforward as they can cause dependency issues or other unwanted side-effects. However, your implementation seems to be quite isolated, so I doubt this is a problem.

corepointer added a commit to corepointer/daphne that referenced this pull request Sep 18, 2024
Adding dependency to the build system that is required by some upcoming PRs.

Co-authored-by: Constantin Pestka <[email protected]>
corepointer added a commit that referenced this pull request Sep 18, 2024
Adding dependency to the build system that is required by some upcoming PRs.

Co-authored-by: Constantin Pestka <[email protected]>
CPestka and others added 15 commits September 19, 2024 00:58
compilation issue caused by fully generic free == operator
- Removed CSRMatrix::getNumItems() since it's inherited from Matrix.
- Made the constructors and destructors of ContiguousTensor/ChunkedTensor private.
  - For consistency with DenseMatrix, CSRMatrix, and Frame.
  - DataObjectFactory is meant to be the only point where the constructors and destructors of DAPHNE data objects are called.
- Throw an exception instead of returning 0 in ContiguousTensor/ChunkedTensor::serialize().
  - Otherwise, the error could easily be ignored.
- Throw an exception instead of calling std::abort() in Tensor::slice()/sliceRow()/sliceCol().
  - Otherwise, it's not clear what consequences an uncontrolled program termination could have (e.g., no clean-up of resources).
- Some more minor changes.
  - E.g., related to comments, year in license headers of new files, etc.
@corepointer
Copy link
Collaborator

I fixed the conflicts and rebased on main. Will force push after running tests again.

Moving the linking to libAllKernels fixes linker errors when building binaries that depend on libAllKernels.
@philipportner
Copy link
Collaborator

Since the liburing dependency has already been commited I added the suggested flags from @pdamme to make this dependency optional. See 941b768

@corepointer
Copy link
Collaborator

Since the liburing dependency has already been commited I added the suggested flags from @pdamme to make this dependency optional. See 941b768

To be honest, I overlooked that request to make it optional. But as sudo is not required anymore, I regard this optionality as optional 😆 scnr Even more so as this dependency is very lightweight and does compile just fine.

@corepointer
Copy link
Collaborator

Two more things:

  • The tests only fail in the container but not when run without it - I did not investigate more closely.
  • The default behavior of the option in build.sh breaks the build process of this feature as USE_IO_URING is not introduced in CMake or the c++ code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants