Taking allocators seriously (#182)

* Start allocator implementations * Replace `istemp` with Val in allocators * Add tensorfree calls in base implementations * Fix typo * Stricter typing in allocation functions * Refactor allocations * slightly modify BaseCopy implementation * Fix some missing Vals * Add @no_escape block in blas_contract * Move Bumper to package extension * Specify unexported macro for julia 1.8 * Update docs * Remove extraneous arguments * Fix docstring * Link TBLIS in docs * add ncon backend support and tests * improved backend tests * address first set of comments * fix cuda issue * update ncon and improve extra macros; more tests part 1 * extensions on julia v1.8 * Formatter [no ci] * fix and test cuTENSORExt * add bumper tests * fix and test zero-dimensional bumper allocations --------- Co-authored-by: Jutho <[email protected]>
Jutho · Jul 12, 2024 · 754aa96 · 754aa96
1 parent 625e111
commit 754aa96
Show file tree

Hide file tree

Showing 19 changed files with 757 additions and 353 deletions.
diff --git a/Project.toml b/Project.toml
@@ -13,23 +13,27 @@ ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
 LRUCache = "8ac3fa9e-de4c-5943-b1dc-09c6b5f20637"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 PackageExtensionCompat = "65ce6f38-6b18-4e1d-a461-8949797d7930"
+PtrArrays = "43287f4e-b6f4-7ad1-bb20-aadabca52c3d"
 Strided = "5e0ebb24-38b0-5f93-81fe-25c709ecae67"
 StridedViews = "4db3bf67-4bd7-4b4e-b153-31dc3fb37143"
 TupleTools = "9d95972d-f1c8-5527-a6e0-b4b365fa01f6"
 VectorInterface = "409d34a3-91d5-4945-b6ec-7529ddf182d8"
 cuTENSOR = "011b41b2-24ef-40a8-b3eb-fa098493e9e1"
 
 [weakdeps]
+Bumper = "8ce10254-0962-460f-a3d8-1f77fea1446e"
 CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
 cuTENSOR = "011b41b2-24ef-40a8-b3eb-fa098493e9e1"
 
 [extensions]
+TensorOperationsBumperExt = "Bumper"
 TensorOperationsChainRulesCoreExt = "ChainRulesCore"
 TensorOperationscuTENSORExt = ["cuTENSOR", "CUDA"]
 
 [compat]
 Aqua = "0.6, 0.7, 0.8"
+Bumper = "0.6"
 CUDA = "5.4.0"
 ChainRulesCore = "1"
 ChainRulesTestUtils = "1"
@@ -38,6 +42,7 @@ LRUCache = "1"
 LinearAlgebra = "1.6"
 Logging = "1.6"
 PackageExtensionCompat = "1"
+PtrArrays = "1.2"
 Random = "1"
 Strided = "2.0.4"
 StridedViews = "0.3"
@@ -49,6 +54,7 @@ julia = "1.8"
 
 [extras]
 Aqua = "4c88cf16-eb10-579e-8560-4a9242c79595"
+Bumper = "8ce10254-0962-460f-a3d8-1f77fea1446e"
 CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 ChainRulesTestUtils = "cdddcdb0-9152-4a09-a978-84456f9df70a"
 DynamicPolynomials = "7c1d4256-1411-5781-91ec-d7bc3513ac07"
@@ -67,4 +73,5 @@ test = [
     "cuTENSOR",
     "Aqua",
     "Logging",
+    "Bumper",
 ]
diff --git a/docs/make.jl b/docs/make.jl
@@ -9,6 +9,7 @@ makedocs(; modules=[TensorOperations],
                 "Manual" => ["man/indexnotation.md",
                              "man/functions.md",
                              "man/interface.md",
+                             "man/backends.md",
                              "man/autodiff.md",
                              "man/implementation.md"],
                 "Index" => "index/index.md"])

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -5,7 +5,7 @@
 ## Table of contents
 
 ```@contents
-Pages = ["index.md", "man/indexnotation.md", "man/functions.md", "man/autodiff.md", "man/interface.md", "man/implementation.md"]
+Pages = ["index.md", "man/indexnotation.md", "man/functions.md", "man/interface.md", "man/backends.md", "man/autodiff.md", "man/implementation.md"]
 Depth = 4
 ```
 
@@ -82,7 +82,5 @@ complicated tensor expression is deconstructed.
 
 ## To do list
 
-  - Add more backends, e.g. using pure Julia Base functionality, or using
-    [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl)
   - Make it easier to modify the contraction order algorithm or its cost function (e.g. to
     optimize based on memory footprint) or to splice in runtime information.
diff --git a/docs/src/man/backends.md b/docs/src/man/backends.md
@@ -0,0 +1,131 @@
+# Backends and Allocators
+
+The `TensorOperations` package is designed to provide powerful tools for performing tensor computations efficiently.
+In advanced use cases, it can be desirable to squeeze the last drops of performance out of the library, by experimenting with either different micro-optimized implementations of the same operation, or by altering the memory management system.
+Here, we detail how to access these functionalities.
+
+## Backends
+
+### Backend Selection
+
+`TensorOperations` supports multiple backends for tensor contractions, allowing users to choose different implementations based on their specific needs.
+While special care is taken to ensure good defaults, we also provide the flexibility to select a backend manually.
+This can be achieved in a variety of ways:
+
+1. **Global setting**: The default backend can be set globally on a per-type basis, as well as a per-function basis. This is achieved by hooking into the implementation of the default backend selection procedure. In particular, this procedure ends up calling [`select_backend`](@ref)`, which can be overloaded to return a different backend.
+
+2. **Local setting**: Alternatively, the backend can be set locally for a specific call to either [`@tensor`](@ref), [`ncon`](@ref) or the function-based interface. Both `@tensor` and `ncon` accept a keyword argument `backend`, which will locally override the default backend selection mechanism. The result is that the specified backend will be inserted as a final argument to all calls of the primitive tensor operations. This is also how this can be achieved in the function-based interface.
+
+```julia
+using TensorOperations
+mybackend = StridedNative()
+
+# inserting a backend into the @tensor macro
+@tensor backend = mybackend A[i,j] := B[i,k] * C[k,j]
+
+# inserting a backend into the ncon function
+D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; backend=mybackend)
+
+# inserting a backend into the function-based interface
+tensoradd(A, pA, conjA, B, pB, conjB, α, β, mybackend)
+```
+
+### Available Backends
+
+`TensorOperations` provides some options for backends out-of-the box.
+In particular, the following backends are available:
+
+```@docs
+DefaultBackend
+BaseCopy
+BaseView
+StridedNative
+StridedBLAS
+cuTENSORBackend
+```
+
+Here, arrays that are strided are typically handled most efficiently by the `Strided.jl`-based backends.
+By default, the `StridedBLAS` backend is used for element types that support BLAS operations, as it seems that the performance gains from using BLAS outweigh the overhead of sometimes having to allocate intermediate permuted arrays.
+
+On the other hand, the `BaseCopy` and `BaseView` backends are used for arrays that are not strided.
+These are designed to be as general as possible, and as a result are not as performant as specific implementations.
+Nevertheless, they can be useful for debugging purposes or for working with custom tensor types that have limited support for methods outside of `Base`.
+
+Finally, we also provide a `cuTENSORBackend` for use with the `cuTENSOR.jl` library, which is a NVidia GPU-accelerated tensor contraction library.
+This backend is only available through a package extension for `cuTENSOR`.
+
+### Custom Backends
+
+Users can also define their own backends, to facilitate experimentation with new implementations.
+This can be done by defining a new type that is a subtype of `AbstractBackend`, and dispatching on this type in the implementation of the primitive tensor operations.
+In particular, the only required implemented methods are [`tensoradd!`](@ref), [`tensortrace!`](@ref), [`tensorcontract!`](@ref).
+
+For example, [`TensorOperationsTBLIS`](https://github.com/lkdvos/TensorOperationsTBLIS.jl) is a wrapper that provides a backend for tensor contractions using the [TBLIS](https://github.com/devinamatthews/tblis) library.
+
+## Allocators
+
+Evaluating complex tensor networks is typically done most efficiently by pairwise operations.
+As a result, this procedure often requires the allocation of many temporary arrays, which can affect performance for certain operations.
+To mitigate this, `TensorOperations` exposes an allocator system, which allows users to more finely control the allocation of both output tensors and temporary tensors.
+
+In particular, the allocator system is used in multiple ways:
+As mentioned before, it can be used to allocate and free the intermediate tensors that are required to evaluate a tensor network in a pairwise fashion.
+Additionally, it can also be used to allocate and free temporary objects that arise when reshaping and permuting input tensors, for example when making them compatible with BLAS instructions.
+
+### Allocator Selection
+
+The allocator system can only be accessed *locally*, by passing an allocator to the `@tensor` macro, the `ncon` function, or the function-based interface.
+
+```julia
+using TensorOperations
+myallocator = ManualAllocator()
+
+# inserting a backend into the @tensor macro
+@tensor allocator = myallocator A[i,j] := B[i,k] * C[k,j]
+
+# inserting an allocator into the ncon function
+D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; allocator=myallocator)
+
+# inserting a backend into the function-based interface
+tensoradd(A, pA, conjA, B, pB, conjB, α, β, DefaultBackend(), myallocator)
+```
+
+Important to note here is that the backend system is prioritized over the allocator system.
+In particular, this means that the backend will be selected **first**, while only then the allocator should be inserted.
+
+### Available Allocators
+
+`TensorOperations` also provides some options for allocators out-of-the box.
+
+```@docs
+DefaultAllocator
+ManualAllocator
+```
+
+By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system.
+Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector.
+In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.
+
+Finally, users can also opt to use the `Bumper.jl` system, which pre-allocates a slab of memory that can be re-used afterwards.
+This is available through a package extension for `Bumper`.
+Here, the `allocator` object is just the provided buffers, which are then used to store the intermediate tensors.
+
+```julia
+using TensorOperations, Bumper
+buf = Bumper.default_buffer()
+@no_escape buf
+    @tensor allocator = buf A[i,j] := B[i,k] * C[k,j]
+end
+```
+For convenience, the construction above is also provided in a specialized macro form which is fully equivalent:
+
+```@docs
+@butensor
+```
+
+### Custom Allocators
+
+Users can also define their own allocators, to facilitate experimentation with new implementations.
+Here, no restriction is made on the type of the allocator, and any object can be passed as an allocator.
+The required implementated methods are [`tensoralloc`](@ref) and [`tensorfree`](@ref).
+
diff --git a/ext/TensorOperationsBumperExt.jl b/ext/TensorOperationsBumperExt.jl
@@ -0,0 +1,47 @@
+module TensorOperationsBumperExt
+
+using TensorOperations
+using Bumper
+
+function TensorOperations.tensoralloc(::Type{A}, structure, ::Val{istemp},
+                                      buf::Union{SlabBuffer,AllocBuffer}) where {A<:AbstractArray,
+                                                                                 istemp}
+    # TODO: remove the `ndims` check if this is fixed in Bumper / StrideArraysCore
+    if istemp & ndims(A) > 0
+        return Bumper.alloc!(buf, eltype(A), structure...)
+    else
+        return TensorOperations.tensoralloc(A, structure, Val(istemp))
+    end
+end
+
+function TensorOperations.blas_contract!(C, A, pA, conjA, B, pB, conjB, pAB, α, β,
+                                         allocator::Union{SlabBuffer,AllocBuffer})
+    @no_escape allocator begin
+        C = Base.@invoke TensorOperations.blas_contract!(C::Any, A::Any, pA::Any,
+                                                         conjA::Any,
+                                                         B::Any, pB::Any,
+                                                         conjB::Any, pAB::Any, α::Any,
+                                                         β::Any,
+                                                         allocator::Any)
+    end
+    return C
+end
+
+function TensorOperations._butensor(src, ex...)
+    buf_sym = gensym("buffer")
+    cp_sym = gensym("checkpoint")
+    res_sym = gensym("result")
+
+    # TODO: there is no check for doubled tensor kwargs
+    newex = quote
+        $buf_sym = $(Expr(:call, GlobalRef(Bumper, :default_buffer)))
+        $cp_sym = $(Expr(:call, GlobalRef(Bumper, :checkpoint_save), buf_sym))
+        $res_sym = $(Expr(:macrocall, GlobalRef(TensorOperations, Symbol("@tensor")),
+                          src, :(allocator = $buf_sym), ex...))
+        $(Expr(:call, GlobalRef(Bumper, :checkpoint_restore!), cp_sym))
+        $res_sym
+    end
+    return return Base.remove_linenums!(newex)
+end
+
+end
diff --git a/ext/TensorOperationscuTENSORExt.jl b/ext/TensorOperationscuTENSORExt.jl
@@ -29,9 +29,27 @@ using CUDA.Adapt: adapt
 using Strided
 using TupleTools: TupleTools as TT
 
-StridedViewsCUDAExt = Base.get_extension(Strided.StridedViews, :StridedViewsCUDAExt)
+const StridedViewsCUDAExt = @static if isdefined(Base, :get_extension)
+    Base.get_extension(Strided.StridedViews, :StridedViewsCUDAExt)
+else
+    Strided.StridedViews.StridedViewsCUDAExt
+end
 isnothing(StridedViewsCUDAExt) && error("StridedViewsCUDAExt not found")
 
+#-------------------------------------------------------------------------------------------
+# @cutensor macro
+#-------------------------------------------------------------------------------------------
+function TensorOperations._cutensor(src, ex...)
+    # TODO: there is no check for doubled tensor kwargs
+    return Expr(:macrocall, GlobalRef(TensorOperations, Symbol("@tensor")),
+                src,
+                Expr(:(=), :backend,
+                     Expr(:call, GlobalRef(TensorOperations, :cuTENSORBackend))),
+                Expr(:(=), :allocator,
+                     Expr(:call, GlobalRef(TensorOperations, :CUDAAllocator))),
+                ex...)
+end
+
 #-------------------------------------------------------------------------------------------
 # Backend selection and passing
 #-------------------------------------------------------------------------------------------
@@ -63,7 +81,7 @@ function TO.tensoradd!(C::AbstractArray,
     C_cuda, isview = _custrided(C, allocator)
     A_cuda, = _custrided(A, allocator)
     tensoradd!(C_cuda, A_cuda, pA, conjA, α, β, backend, allocator)
-    isview || copy!(C, C_cuda)
+    isview || copy!(C, C_cuda.parent)
     return C
 end
 function TO.tensorcontract!(C::AbstractArray,
@@ -77,7 +95,7 @@ function TO.tensorcontract!(C::AbstractArray,
     B_cuda, = _custrided(B, allocator)
     tensorcontract!(C_cuda, A_cuda, pA, conjA, B_cuda, pB, conjB, pAB, α, β, backend,
                     allocator)
-    isview || copy!(C, C_cuda)
+    isview || copy!(C, C_cuda.parent)
     return C
 end
 function TO.tensortrace!(C::AbstractArray,
@@ -87,7 +105,7 @@ function TO.tensortrace!(C::AbstractArray,
     C_cuda, isview = _custrided(C, allocator)
     A_cuda, = _custrided(A, allocator)
     tensortrace!(C_cuda, A_cuda, p, q, conjA, α, β, backend, allocator)
-    isview || copy!(C, C_cuda)
+    isview || copy!(C, C_cuda.parent)
     return C
 end
 
@@ -125,7 +143,7 @@ function CUDAAllocator()
 end
 
 function TO.tensoralloc_add(TC, A::AbstractArray, pA::Index2Tuple, conjA::Bool,
-                            istemp::Bool,
+                            istemp::Val,
                             allocator::CUDAAllocator)
     ttype = CuArray{TC,TO.numind(pA)}
     structure = TO.tensoradd_structure(A, pA, conjA)
@@ -136,7 +154,7 @@ function TO.tensoralloc_contract(TC,
                                  A::AbstractArray, pA::Index2Tuple, conjA::Bool,
                                  B::AbstractArray, pB::Index2Tuple, conjB::Bool,
                                  pAB::Index2Tuple,
-                                 istemp::Bool,
+                                 istemp::Val,
                                  allocator::CUDAAllocator)
     ttype = CuArray{TC,TO.numind(pAB)}
     structure = TO.tensorcontract_structure(A, pA, conjA, B, pB, conjB, pAB)
@@ -150,8 +168,9 @@ end
 
 # NOTE: the general implementation in the `DefaultAllocator` case works just fine, without
 # selecting an explicit memory model
-function TO.tensoralloc(::Type{CuArray{T,N}}, structure, istemp::Bool,
-                        allocator::CUDAAllocator{Mout,Min,Mtemp}) where {T,N,Mout,Min,Mtemp}
+function TO.tensoralloc(::Type{CuArray{T,N}}, structure, ::Val{istemp},
+                        allocator::CUDAAllocator{Mout,Min,Mtemp}) where {T,N,istemp,Mout,
+                                                                         Min,Mtemp}
     M = istemp ? Mtemp : Mout
     return CuArray{T,N,M}(undef, structure)
 end

diff --git a/src/TensorOperations.jl b/src/TensorOperations.jl
@@ -7,6 +7,7 @@ using LinearAlgebra
 using LinearAlgebra: mul!, BlasFloat
 using Strided
 using StridedViews: isstrided
+using PtrArrays
 using LRUCache
 
 using Base.Meta: isexpr
@@ -15,7 +16,7 @@ using Base.Meta: isexpr
 #---------
 # export macro API
 export @tensor, @tensoropt, @tensoropt_verbose, @optimalcontractiontree, @notensor, @ncon
-export @cutensor
+export @cutensor, @butensor
 
 # export function based API
 export ncon