Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark OMEinsum against Finch #259

Open
mofeing opened this issue Nov 21, 2024 · 7 comments
Open

Benchmark OMEinsum against Finch #259

mofeing opened this issue Nov 21, 2024 · 7 comments
Labels
good first issue Good for newcomers performance Makes the code go "brrrr"

Comments

@mofeing
Copy link
Member

mofeing commented Nov 21, 2024

There is this nice package by @willow-ahrens https://github.com/finch-tensor/Finch.jl, which is a compiler for tensor algebra. In principle is optimized for sparse algebra, but I know that it also supports dense tensor algebra and would like to OMEinsum.einsum vs Finch.tensordot.

Although we have Reactant.jl for squeezing every drop of performance, I would like to have a better default contraction since OMEinsum is incredibly slow for dynamic einsum and TensorOperations.jl doesn't support all the einsum cases we use.

@mofeing mofeing added good first issue Good for newcomers performance Makes the code go "brrrr" labels Nov 21, 2024
@willow-ahrens
Copy link

willow-ahrens commented Nov 27, 2024

Finch also supports a macro @einsum. Do you have any kernels in mind?

It might be interesting to compare the calling overhead of the two libraries, as Finch also handles dynamic tensor expressions and compiles kernels for them. (i.e. how long does it take to compile a new einsum, and how long does it take to run an already-compiled one)?

Finch is really only optimized for sparse tensors. In the dense case, Finch is only as good as writing "for i; for j; for k; ...".

@mofeing
Copy link
Member Author

mofeing commented Nov 27, 2024

Finch also supports a macro @einsum. Do you have any kernels in mind?

In our case, einsum expressions are chosen dynamically on run-time and can be big (like involving around 30 indices or more).

It might be interesting to compare the calling overhead of the two libraries, as Finch also handles dynamic tensor expressions and compiles kernels for them. (i.e. how long does it take to compile a new einsum, and how long does it take to run an already-compiled one)?

So this is the main reason: We found out that OMEinsum has a huuuuuge overhead on this dynamic case. We're not using the @einsum macro but the methods bellow and still the overhead is like 5-orders of magnitude bigger than calling Reactant.jl compiled function.
We thought about using TensorOperations.jl but it doesn't support all the einsum rules that we use.

Finch is really only optimized for sparse tensors. In the dense case, Finch is only as good as writing "for i; for j; for k; ...".

Ah that's a pity for the dense case. My understanding was that it wasn't fully optimized but could do sth more. How about a sparse-dense pairwise contraction? And block-diagonal tensors?

@willow-ahrens
Copy link

willow-ahrens commented Nov 27, 2024

Sparse-dense runs well in Finch. We're currently considering approaches to offload the dense finch code to an optimized dense framework for added performance improvement

@willow-ahrens
Copy link

Finch can do block matrices if you represent them as a 4-tensor. We're currently working on more streamlined approaches for block matrices, but the current format would be:

Tensor(Dense(SparsePinpoint(Dense(Dense(Element(0.0))))))

@willow-ahrens
Copy link

| In our case, einsum expressions are chosen dynamically on run-time and can be big (like involving around 30 indices or more). We're not using the @Einsum macro but the methods bellow and still the overhead is like 5-orders of magnitude bigger than calling Reactant.jl compiled function.

If big kernels are the goal, I would try using the Galley scheduler. It was designed to break big einsums up into manageable pieces. I'll mention @kylebd99 as the lead author of that scheduler.

@mofeing
Copy link
Member Author

mofeing commented Nov 28, 2024

Finch can do block matrices if you represent them as a 4-tensor.

and how about generalize n-order tensors?

If big kernels are the goal, I would try using the Galley scheduler. It was designed to break big einsums up into manageable pieces. I'll mention @kylebd99 as the lead author of that scheduler.

do you mean this paper? https://arxiv.org/pdf/2408.14706v2

@willow-ahrens
Copy link

Yes, Finch supports general order-n tensors. For example,

julia> using Finch

julia> N = 100
100

julia> A = Tensor(CSFFormat(3), fsprand(N, N, N, 0.001)); B = rand(N, N); C = rand(N, N);

julia> ndims(A)
3

julia> @einsum D[i, j] += A[i, k, l] * B[j, k] * C[j, l]

Galley is included with Finch, and can be invoked as:

julia> using Finch, BenchmarkTools

julia> A = fsprand(1000, 1000, 0.1); B = fsprand(1000, 1000, 0.1); C = fsprand(1000, 1000, 0.0001);

julia> A = lazy(A); B = lazy(B); C = lazy(C);

julia> sum(A * B * C)

julia> @btime compute(sum(A * B * C));
  263.612 ms (1012 allocations: 185.08 MiB)

julia> @btime compute(sum(A * B * C), ctx=galley_scheduler());
  153.708 μs (667 allocations: 29.02 KiB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers performance Makes the code go "brrrr"
Projects
None yet
Development

No branches or pull requests

2 participants