Benchmark OMEinsum against Finch #259

mofeing · 2024-11-21T17:07:23Z

There is this nice package by @willow-ahrens https://github.com/finch-tensor/Finch.jl, which is a compiler for tensor algebra. In principle is optimized for sparse algebra, but I know that it also supports dense tensor algebra and would like to OMEinsum.einsum vs Finch.tensordot.

Although we have Reactant.jl for squeezing every drop of performance, I would like to have a better default contraction since OMEinsum is incredibly slow for dynamic einsum and TensorOperations.jl doesn't support all the einsum cases we use.

The text was updated successfully, but these errors were encountered:

willow-ahrens · 2024-11-27T04:00:30Z

Finch also supports a macro @einsum. Do you have any kernels in mind?

It might be interesting to compare the calling overhead of the two libraries, as Finch also handles dynamic tensor expressions and compiles kernels for them. (i.e. how long does it take to compile a new einsum, and how long does it take to run an already-compiled one)?

Finch is really only optimized for sparse tensors. In the dense case, Finch is only as good as writing "for i; for j; for k; ...".

mofeing · 2024-11-27T08:27:42Z

Finch also supports a macro @einsum. Do you have any kernels in mind?

In our case, einsum expressions are chosen dynamically on run-time and can be big (like involving around 30 indices or more).

It might be interesting to compare the calling overhead of the two libraries, as Finch also handles dynamic tensor expressions and compiles kernels for them. (i.e. how long does it take to compile a new einsum, and how long does it take to run an already-compiled one)?

So this is the main reason: We found out that OMEinsum has a huuuuuge overhead on this dynamic case. We're not using the @einsum macro but the methods bellow and still the overhead is like 5-orders of magnitude bigger than calling Reactant.jl compiled function.
We thought about using TensorOperations.jl but it doesn't support all the einsum rules that we use.

Finch is really only optimized for sparse tensors. In the dense case, Finch is only as good as writing "for i; for j; for k; ...".

Ah that's a pity for the dense case. My understanding was that it wasn't fully optimized but could do sth more. How about a sparse-dense pairwise contraction? And block-diagonal tensors?

willow-ahrens · 2024-11-27T17:12:56Z

Sparse-dense runs well in Finch. We're currently considering approaches to offload the dense finch code to an optimized dense framework for added performance improvement

willow-ahrens · 2024-11-27T17:14:14Z

Finch can do block matrices if you represent them as a 4-tensor. We're currently working on more streamlined approaches for block matrices, but the current format would be:

Tensor(Dense(SparsePinpoint(Dense(Dense(Element(0.0))))))

willow-ahrens · 2024-11-27T17:17:53Z

| In our case, einsum expressions are chosen dynamically on run-time and can be big (like involving around 30 indices or more). We're not using the @Einsum macro but the methods bellow and still the overhead is like 5-orders of magnitude bigger than calling Reactant.jl compiled function.

If big kernels are the goal, I would try using the Galley scheduler. It was designed to break big einsums up into manageable pieces. I'll mention @kylebd99 as the lead author of that scheduler.

mofeing · 2024-11-28T11:02:35Z

Finch can do block matrices if you represent them as a 4-tensor.

and how about generalize n-order tensors?

If big kernels are the goal, I would try using the Galley scheduler. It was designed to break big einsums up into manageable pieces. I'll mention @kylebd99 as the lead author of that scheduler.

do you mean this paper? https://arxiv.org/pdf/2408.14706v2

willow-ahrens · 2024-11-28T14:52:01Z

Yes, Finch supports general order-n tensors. For example,

julia> using Finch

julia> N = 100
100

julia> A = Tensor(CSFFormat(3), fsprand(N, N, N, 0.001)); B = rand(N, N); C = rand(N, N);

julia> ndims(A)
3

julia> @einsum D[i, j] += A[i, k, l] * B[j, k] * C[j, l]

Galley is included with Finch, and can be invoked as:

julia> using Finch, BenchmarkTools

julia> A = fsprand(1000, 1000, 0.1); B = fsprand(1000, 1000, 0.1); C = fsprand(1000, 1000, 0.0001);

julia> A = lazy(A); B = lazy(B); C = lazy(C);

julia> sum(A * B * C)

julia> @btime compute(sum(A * B * C));
  263.612 ms (1012 allocations: 185.08 MiB)

julia> @btime compute(sum(A * B * C), ctx=galley_scheduler());
  153.708 μs (667 allocations: 29.02 KiB)

mofeing added good first issue Good for newcomers performance Makes the code go "brrrr" labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark OMEinsum against Finch #259

Benchmark OMEinsum against Finch #259

mofeing commented Nov 21, 2024

willow-ahrens commented Nov 27, 2024 •

edited

Loading

mofeing commented Nov 27, 2024

willow-ahrens commented Nov 27, 2024 •

edited

Loading

willow-ahrens commented Nov 27, 2024

willow-ahrens commented Nov 27, 2024

mofeing commented Nov 28, 2024 •

edited

Loading

willow-ahrens commented Nov 28, 2024

Benchmark OMEinsum against Finch #259

Benchmark OMEinsum against Finch #259

Comments

mofeing commented Nov 21, 2024

willow-ahrens commented Nov 27, 2024 • edited Loading

mofeing commented Nov 27, 2024

willow-ahrens commented Nov 27, 2024 • edited Loading

willow-ahrens commented Nov 27, 2024

willow-ahrens commented Nov 27, 2024

mofeing commented Nov 28, 2024 • edited Loading

willow-ahrens commented Nov 28, 2024

willow-ahrens commented Nov 27, 2024 •

edited

Loading

willow-ahrens commented Nov 27, 2024 •

edited

Loading

mofeing commented Nov 28, 2024 •

edited

Loading