Skip to content

Commit

Permalink
Taking allocators seriously (#182)
Browse files Browse the repository at this point in the history
* Start allocator implementations

* Replace `istemp` with Val in allocators

* Add tensorfree calls in base implementations

* Fix typo

* Stricter typing in allocation functions

* Refactor allocations

* slightly modify BaseCopy implementation

* Fix some missing Vals

* Add @no_escape block in blas_contract

* Move Bumper to package extension

* Specify unexported macro for julia 1.8

* Update docs

* Remove extraneous arguments

* Fix docstring

* Link TBLIS in docs

* add ncon backend support and tests

* improved backend tests

* address first set of comments

* fix cuda issue

* update ncon and improve extra macros; more tests part 1

* extensions on julia v1.8

* Formatter [no ci]

* fix and test cuTENSORExt

* add bumper tests

* fix and test zero-dimensional bumper allocations

---------

Co-authored-by: Jutho <[email protected]>
  • Loading branch information
lkdvos and Jutho authored Jul 12, 2024
1 parent 625e111 commit 754aa96
Show file tree
Hide file tree
Showing 19 changed files with 757 additions and 353 deletions.
7 changes: 7 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,23 +13,27 @@ ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
LRUCache = "8ac3fa9e-de4c-5943-b1dc-09c6b5f20637"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
PackageExtensionCompat = "65ce6f38-6b18-4e1d-a461-8949797d7930"
PtrArrays = "43287f4e-b6f4-7ad1-bb20-aadabca52c3d"
Strided = "5e0ebb24-38b0-5f93-81fe-25c709ecae67"
StridedViews = "4db3bf67-4bd7-4b4e-b153-31dc3fb37143"
TupleTools = "9d95972d-f1c8-5527-a6e0-b4b365fa01f6"
VectorInterface = "409d34a3-91d5-4945-b6ec-7529ddf182d8"
cuTENSOR = "011b41b2-24ef-40a8-b3eb-fa098493e9e1"

[weakdeps]
Bumper = "8ce10254-0962-460f-a3d8-1f77fea1446e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
cuTENSOR = "011b41b2-24ef-40a8-b3eb-fa098493e9e1"

[extensions]
TensorOperationsBumperExt = "Bumper"
TensorOperationsChainRulesCoreExt = "ChainRulesCore"
TensorOperationscuTENSORExt = ["cuTENSOR", "CUDA"]

[compat]
Aqua = "0.6, 0.7, 0.8"
Bumper = "0.6"
CUDA = "5.4.0"
ChainRulesCore = "1"
ChainRulesTestUtils = "1"
Expand All @@ -38,6 +42,7 @@ LRUCache = "1"
LinearAlgebra = "1.6"
Logging = "1.6"
PackageExtensionCompat = "1"
PtrArrays = "1.2"
Random = "1"
Strided = "2.0.4"
StridedViews = "0.3"
Expand All @@ -49,6 +54,7 @@ julia = "1.8"

[extras]
Aqua = "4c88cf16-eb10-579e-8560-4a9242c79595"
Bumper = "8ce10254-0962-460f-a3d8-1f77fea1446e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
ChainRulesTestUtils = "cdddcdb0-9152-4a09-a978-84456f9df70a"
DynamicPolynomials = "7c1d4256-1411-5781-91ec-d7bc3513ac07"
Expand All @@ -67,4 +73,5 @@ test = [
"cuTENSOR",
"Aqua",
"Logging",
"Bumper",
]
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ makedocs(; modules=[TensorOperations],
"Manual" => ["man/indexnotation.md",
"man/functions.md",
"man/interface.md",
"man/backends.md",
"man/autodiff.md",
"man/implementation.md"],
"Index" => "index/index.md"])
Expand Down
4 changes: 1 addition & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
## Table of contents

```@contents
Pages = ["index.md", "man/indexnotation.md", "man/functions.md", "man/autodiff.md", "man/interface.md", "man/implementation.md"]
Pages = ["index.md", "man/indexnotation.md", "man/functions.md", "man/interface.md", "man/backends.md", "man/autodiff.md", "man/implementation.md"]
Depth = 4
```

Expand Down Expand Up @@ -82,7 +82,5 @@ complicated tensor expression is deconstructed.

## To do list

- Add more backends, e.g. using pure Julia Base functionality, or using
[LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl)
- Make it easier to modify the contraction order algorithm or its cost function (e.g. to
optimize based on memory footprint) or to splice in runtime information.
131 changes: 131 additions & 0 deletions docs/src/man/backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Backends and Allocators

The `TensorOperations` package is designed to provide powerful tools for performing tensor computations efficiently.
In advanced use cases, it can be desirable to squeeze the last drops of performance out of the library, by experimenting with either different micro-optimized implementations of the same operation, or by altering the memory management system.
Here, we detail how to access these functionalities.

## Backends

### Backend Selection

`TensorOperations` supports multiple backends for tensor contractions, allowing users to choose different implementations based on their specific needs.
While special care is taken to ensure good defaults, we also provide the flexibility to select a backend manually.
This can be achieved in a variety of ways:

1. **Global setting**: The default backend can be set globally on a per-type basis, as well as a per-function basis. This is achieved by hooking into the implementation of the default backend selection procedure. In particular, this procedure ends up calling [`select_backend`](@ref)`, which can be overloaded to return a different backend.

2. **Local setting**: Alternatively, the backend can be set locally for a specific call to either [`@tensor`](@ref), [`ncon`](@ref) or the function-based interface. Both `@tensor` and `ncon` accept a keyword argument `backend`, which will locally override the default backend selection mechanism. The result is that the specified backend will be inserted as a final argument to all calls of the primitive tensor operations. This is also how this can be achieved in the function-based interface.

```julia
using TensorOperations
mybackend = StridedNative()

# inserting a backend into the @tensor macro
@tensor backend = mybackend A[i,j] := B[i,k] * C[k,j]

# inserting a backend into the ncon function
D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; backend=mybackend)

# inserting a backend into the function-based interface
tensoradd(A, pA, conjA, B, pB, conjB, α, β, mybackend)
```

### Available Backends

`TensorOperations` provides some options for backends out-of-the box.
In particular, the following backends are available:

```@docs
DefaultBackend
BaseCopy
BaseView
StridedNative
StridedBLAS
cuTENSORBackend
```

Here, arrays that are strided are typically handled most efficiently by the `Strided.jl`-based backends.
By default, the `StridedBLAS` backend is used for element types that support BLAS operations, as it seems that the performance gains from using BLAS outweigh the overhead of sometimes having to allocate intermediate permuted arrays.

On the other hand, the `BaseCopy` and `BaseView` backends are used for arrays that are not strided.
These are designed to be as general as possible, and as a result are not as performant as specific implementations.
Nevertheless, they can be useful for debugging purposes or for working with custom tensor types that have limited support for methods outside of `Base`.

Finally, we also provide a `cuTENSORBackend` for use with the `cuTENSOR.jl` library, which is a NVidia GPU-accelerated tensor contraction library.
This backend is only available through a package extension for `cuTENSOR`.

### Custom Backends

Users can also define their own backends, to facilitate experimentation with new implementations.
This can be done by defining a new type that is a subtype of `AbstractBackend`, and dispatching on this type in the implementation of the primitive tensor operations.
In particular, the only required implemented methods are [`tensoradd!`](@ref), [`tensortrace!`](@ref), [`tensorcontract!`](@ref).

For example, [`TensorOperationsTBLIS`](https://github.com/lkdvos/TensorOperationsTBLIS.jl) is a wrapper that provides a backend for tensor contractions using the [TBLIS](https://github.com/devinamatthews/tblis) library.

## Allocators

Evaluating complex tensor networks is typically done most efficiently by pairwise operations.
As a result, this procedure often requires the allocation of many temporary arrays, which can affect performance for certain operations.
To mitigate this, `TensorOperations` exposes an allocator system, which allows users to more finely control the allocation of both output tensors and temporary tensors.

In particular, the allocator system is used in multiple ways:
As mentioned before, it can be used to allocate and free the intermediate tensors that are required to evaluate a tensor network in a pairwise fashion.
Additionally, it can also be used to allocate and free temporary objects that arise when reshaping and permuting input tensors, for example when making them compatible with BLAS instructions.

### Allocator Selection

The allocator system can only be accessed *locally*, by passing an allocator to the `@tensor` macro, the `ncon` function, or the function-based interface.

```julia
using TensorOperations
myallocator = ManualAllocator()

# inserting a backend into the @tensor macro
@tensor allocator = myallocator A[i,j] := B[i,k] * C[k,j]

# inserting an allocator into the ncon function
D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; allocator=myallocator)

# inserting a backend into the function-based interface
tensoradd(A, pA, conjA, B, pB, conjB, α, β, DefaultBackend(), myallocator)
```

Important to note here is that the backend system is prioritized over the allocator system.
In particular, this means that the backend will be selected **first**, while only then the allocator should be inserted.

### Available Allocators

`TensorOperations` also provides some options for allocators out-of-the box.

```@docs
DefaultAllocator
ManualAllocator
```

By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system.
Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector.
In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.

Finally, users can also opt to use the `Bumper.jl` system, which pre-allocates a slab of memory that can be re-used afterwards.
This is available through a package extension for `Bumper`.
Here, the `allocator` object is just the provided buffers, which are then used to store the intermediate tensors.

```julia
using TensorOperations, Bumper
buf = Bumper.default_buffer()
@no_escape buf
@tensor allocator = buf A[i,j] := B[i,k] * C[k,j]
end
```
For convenience, the construction above is also provided in a specialized macro form which is fully equivalent:

```@docs
@butensor
```

### Custom Allocators

Users can also define their own allocators, to facilitate experimentation with new implementations.
Here, no restriction is made on the type of the allocator, and any object can be passed as an allocator.
The required implementated methods are [`tensoralloc`](@ref) and [`tensorfree`](@ref).

47 changes: 47 additions & 0 deletions ext/TensorOperationsBumperExt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
module TensorOperationsBumperExt

using TensorOperations
using Bumper

function TensorOperations.tensoralloc(::Type{A}, structure, ::Val{istemp},
buf::Union{SlabBuffer,AllocBuffer}) where {A<:AbstractArray,
istemp}
# TODO: remove the `ndims` check if this is fixed in Bumper / StrideArraysCore
if istemp & ndims(A) > 0
return Bumper.alloc!(buf, eltype(A), structure...)
else
return TensorOperations.tensoralloc(A, structure, Val(istemp))
end
end

function TensorOperations.blas_contract!(C, A, pA, conjA, B, pB, conjB, pAB, α, β,
allocator::Union{SlabBuffer,AllocBuffer})
@no_escape allocator begin
C = Base.@invoke TensorOperations.blas_contract!(C::Any, A::Any, pA::Any,
conjA::Any,
B::Any, pB::Any,
conjB::Any, pAB::Any, α::Any,
β::Any,
allocator::Any)
end
return C
end

function TensorOperations._butensor(src, ex...)
buf_sym = gensym("buffer")
cp_sym = gensym("checkpoint")
res_sym = gensym("result")

# TODO: there is no check for doubled tensor kwargs
newex = quote
$buf_sym = $(Expr(:call, GlobalRef(Bumper, :default_buffer)))
$cp_sym = $(Expr(:call, GlobalRef(Bumper, :checkpoint_save), buf_sym))
$res_sym = $(Expr(:macrocall, GlobalRef(TensorOperations, Symbol("@tensor")),
src, :(allocator = $buf_sym), ex...))
$(Expr(:call, GlobalRef(Bumper, :checkpoint_restore!), cp_sym))
$res_sym
end
return return Base.remove_linenums!(newex)
end

end
35 changes: 27 additions & 8 deletions ext/TensorOperationscuTENSORExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,27 @@ using CUDA.Adapt: adapt
using Strided
using TupleTools: TupleTools as TT

StridedViewsCUDAExt = Base.get_extension(Strided.StridedViews, :StridedViewsCUDAExt)
const StridedViewsCUDAExt = @static if isdefined(Base, :get_extension)
Base.get_extension(Strided.StridedViews, :StridedViewsCUDAExt)
else
Strided.StridedViews.StridedViewsCUDAExt
end
isnothing(StridedViewsCUDAExt) && error("StridedViewsCUDAExt not found")

#-------------------------------------------------------------------------------------------
# @cutensor macro
#-------------------------------------------------------------------------------------------
function TensorOperations._cutensor(src, ex...)
# TODO: there is no check for doubled tensor kwargs
return Expr(:macrocall, GlobalRef(TensorOperations, Symbol("@tensor")),
src,
Expr(:(=), :backend,
Expr(:call, GlobalRef(TensorOperations, :cuTENSORBackend))),
Expr(:(=), :allocator,
Expr(:call, GlobalRef(TensorOperations, :CUDAAllocator))),
ex...)
end

#-------------------------------------------------------------------------------------------
# Backend selection and passing
#-------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -63,7 +81,7 @@ function TO.tensoradd!(C::AbstractArray,
C_cuda, isview = _custrided(C, allocator)
A_cuda, = _custrided(A, allocator)
tensoradd!(C_cuda, A_cuda, pA, conjA, α, β, backend, allocator)
isview || copy!(C, C_cuda)
isview || copy!(C, C_cuda.parent)
return C
end
function TO.tensorcontract!(C::AbstractArray,
Expand All @@ -77,7 +95,7 @@ function TO.tensorcontract!(C::AbstractArray,
B_cuda, = _custrided(B, allocator)
tensorcontract!(C_cuda, A_cuda, pA, conjA, B_cuda, pB, conjB, pAB, α, β, backend,
allocator)
isview || copy!(C, C_cuda)
isview || copy!(C, C_cuda.parent)
return C
end
function TO.tensortrace!(C::AbstractArray,
Expand All @@ -87,7 +105,7 @@ function TO.tensortrace!(C::AbstractArray,
C_cuda, isview = _custrided(C, allocator)
A_cuda, = _custrided(A, allocator)
tensortrace!(C_cuda, A_cuda, p, q, conjA, α, β, backend, allocator)
isview || copy!(C, C_cuda)
isview || copy!(C, C_cuda.parent)
return C
end

Expand Down Expand Up @@ -125,7 +143,7 @@ function CUDAAllocator()
end

function TO.tensoralloc_add(TC, A::AbstractArray, pA::Index2Tuple, conjA::Bool,
istemp::Bool,
istemp::Val,
allocator::CUDAAllocator)
ttype = CuArray{TC,TO.numind(pA)}
structure = TO.tensoradd_structure(A, pA, conjA)
Expand All @@ -136,7 +154,7 @@ function TO.tensoralloc_contract(TC,
A::AbstractArray, pA::Index2Tuple, conjA::Bool,
B::AbstractArray, pB::Index2Tuple, conjB::Bool,
pAB::Index2Tuple,
istemp::Bool,
istemp::Val,
allocator::CUDAAllocator)
ttype = CuArray{TC,TO.numind(pAB)}
structure = TO.tensorcontract_structure(A, pA, conjA, B, pB, conjB, pAB)
Expand All @@ -150,8 +168,9 @@ end

# NOTE: the general implementation in the `DefaultAllocator` case works just fine, without
# selecting an explicit memory model
function TO.tensoralloc(::Type{CuArray{T,N}}, structure, istemp::Bool,
allocator::CUDAAllocator{Mout,Min,Mtemp}) where {T,N,Mout,Min,Mtemp}
function TO.tensoralloc(::Type{CuArray{T,N}}, structure, ::Val{istemp},
allocator::CUDAAllocator{Mout,Min,Mtemp}) where {T,N,istemp,Mout,
Min,Mtemp}
M = istemp ? Mtemp : Mout
return CuArray{T,N,M}(undef, structure)
end
Expand Down
3 changes: 2 additions & 1 deletion src/TensorOperations.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ using LinearAlgebra
using LinearAlgebra: mul!, BlasFloat
using Strided
using StridedViews: isstrided
using PtrArrays
using LRUCache

using Base.Meta: isexpr
Expand All @@ -15,7 +16,7 @@ using Base.Meta: isexpr
#---------
# export macro API
export @tensor, @tensoropt, @tensoropt_verbose, @optimalcontractiontree, @notensor, @ncon
export @cutensor
export @cutensor, @butensor

# export function based API
export ncon
Expand Down
Loading

0 comments on commit 754aa96

Please sign in to comment.