Efficiency of matrix assembly #561

edljk · 2022-12-15T18:37:15Z

edljk
Dec 15, 2022

Is there a particular reason to motivate the quadratic cost in the assembling routines in several examples of the documentation? It seems to me significantly more efficient to replace (for instance)

dΩ = getdetJdV(cellvalues, q_point)
for i in 1:n_basefuncs
    ∇v = shape_gradient(cellvalues, q_point, i)
    for j in 1:n_basefuncs
        ∇u = shape_gradient(cellvalues, q_point, j)
        Ke[i, j] += (∇v ⋅ ∇u) * dΩ
     end
 end

by something like

dΩ = getdetJdV(cellvalues, q_point)
∇uv = [shape_gradient(cellvalues, q_point, k) for k in 1:n_basefuncs]
for i in 1:n_basefuncs
    ∇v = ∇uv[i]
    for j in 1:n_basefuncs
        ∇u = ∇uv[j]                        
        Ke[i, j] += (∇v ⋅ ∇u) * dΩ
    end
end

No? I am probably missing something.

termi-official · 2022-12-15T18:39:11Z

termi-official
Dec 15, 2022
Collaborator

If you have the same space for your test and trial spaces, then you could do something along these lines. However, you will be likely slower, because the call to shape_gradient is already just a lookup.

0 replies

edljk · 2022-12-15T18:48:41Z

edljk
Dec 15, 2022
Author

In my very naive tests the gain is sometimes significant

0 replies

fredrikekre · 2022-12-15T18:51:00Z

fredrikekre
Dec 15, 2022
Maintainer

You can check https://ferrite-fem.github.io/Ferrite.jl/stable/examples/stokes-flow/ (

Ferrite.jl/docs/src/literate/stokes-flow.jl

Lines 386 to 389 in 570b3b5

    
           ϕᵤ = Vector{Vec{2,Float64}}(undef, ndofs_u) 
        
           ∇ϕᵤ = Vector{Tensor{2,2,Float64,4}}(undef, ndofs_u) 
        
           divϕᵤ = Vector{Float64}(undef, ndofs_u) 
        
           ϕₚ = Vector{Float64}(undef, ndofs_p)

and

Ferrite.jl/docs/src/literate/stokes-flow.jl

Lines 397 to 404 in 570b3b5

    
           for i in 1:ndofs_u 
        
               ϕᵤ[i] = shape_value(cvu, qp, i) 
        
               ∇ϕᵤ[i] = shape_gradient(cvu, qp, i) 
        
               divϕᵤ[i] = shape_divergence(cvu, qp, i) 
        
           end 
        
           for i in 1:ndofs_p 
        
               ϕₚ[i] = shape_value(cvp, qp, i) 
        
           end

) for an example where such caching is done. The important thing is that you allocate ∇uv outside of the element routine, otherwise you pay the allocation cost for every element.

What did you measure? Just the assembly routine or the full global assembly?

0 replies

edljk · 2022-12-15T18:58:45Z

edljk
Dec 15, 2022
Author

Oups, I missed that example. Yes, I know, my suggestion can be improved regarding allocation. This was just an illustration which has already an impact evaluating all the assembly:

using the pre computing

assemb K           1    10.1s   86.7%   10.1s   3.94GiB   80.4%  3.94GiB
     ∇uv           402k    229ms    2.0%   570ns    178MiB    3.5%     464B
     comp dΩ       402k   60.7ms    0.5%   151ns   6.13MiB    0.1%    16.0B
     assemble!    5.50k   9.60ms    0.1%  1.74μs    344KiB    0.0%    64.0B

initial implementation

 assemb K           1    21.4s   93.4%   21.4s   3.77GiB   79.6%  3.77GiB
     ∇u           40.2M    5.22s   22.8%   130ns   1.20GiB   25.3%    32.0B
     ∇v           4.02M    561ms    2.5%   140ns    123MiB    2.5%    32.0B
     comp dΩ       402k   51.4ms    0.2%   128ns   6.13MiB    0.1%    16.0B
     assemble!    5.50k   9.44ms    0.0%  1.71μs    344KiB    0.0%    64.0B

0 replies

fredrikekre · 2022-12-15T19:01:37Z

fredrikekre
Dec 15, 2022
Maintainer

That looks very suspicious. What are you using to measure that? In particular, ∇u should be completely allocation free.

4 replies

koehlerson Dec 15, 2022
Collaborator

looks like TimerOutputs.jl

edljk Dec 15, 2022
Author

Yes it is

fredrikekre Dec 15, 2022
Maintainer

Yea. Then I think if you put @timeits in such a hot loop that will cause huge slowdowns.

fredrikekre Dec 15, 2022
Maintainer

See e.g. https://github.com/KristofferC/TimerOutputs.jl#overhead

edljk · 2022-12-15T19:07:50Z

edljk
Dec 15, 2022
Author

Great thanks, sorry for the noise..
I do not know if it could also contribute to explain the observations: I am using P3 elements with second order boundary

1 reply

koehlerson Dec 15, 2022
Collaborator

could you @btime it maybe with BenchmarkTools I'm curious to see the comparison, but if its too much to change, then don't mind :)

fredrikekre · 2022-12-15T19:32:32Z

fredrikekre
Dec 15, 2022
Maintainer

As a point of reference, for the heat equation with 200x200 elements, third order interpolation, assembly is slightly longer when caching the values (140ms without caching, 155ms with caching). Likely because you only shuffle data around with the caching, but then have to do almost the equivalent lookup later anyway (i.e. accessing ∇v = ∇uv[i]).

As pointed out above, shape_value and shape_gradient is just a lookup (See

Ferrite.jl/src/FEValues/common_values.jl

Line 84 in 570b3b5

    
           @propagate_inbounds shape_value(cv::CellValues, q_point::Int, base_func::Int) = cv.N[base_func, q_point]

and

Ferrite.jl/src/FEValues/common_values.jl

Line 96 in 570b3b5

    
           @propagate_inbounds shape_gradient(cv::CellValues, q_point::Int, base_func::Int) = cv.dNdx[base_func, q_point]

). This might change if you use e.g. shape_divergence or shape_symmetric_gradient which does some very trivial computation too. (See

Ferrite.jl/src/FEValues/common_values.jl

Line 114 in 570b3b5

    
           @propagate_inbounds shape_divergence(cv::CellScalarValues, q_point::Int, base_func::Int) = sum(cv.dNdx[base_func, q_point])

and

Ferrite.jl/src/FEValues/common_values.jl

Line 105 in 570b3b5

    
           @propagate_inbounds shape_symmetric_gradient(cv::CellVectorValues, q_point::Int, base_func::Int) = symmetric(shape_gradient(cv, q_point, base_func))

). However, compared to assembling into the stiffness matrix, or even solving the linear system, I don't expect the caching to be very beneficial in many cases. But as always, benchmarking is your friend :D

1 reply

termi-official Dec 15, 2022
Collaborator

I want to leave a reference from Prof. Wells here, whose group has done some benchmarking on this stuff over the recent decade. https://youtu.be/D-YcVd4-_2E?t=2550 . In their analyses they concluded that for higher order elements we start running into trouble with too much caching, because of earlier cache spilling.

edljk · 2022-12-15T19:37:53Z

edljk
Dec 15, 2022
Author

Thank you so much for the reactivity.
Yes, it seems using @timeit in an irrelevant way is the explanation.
Using btime with the following peaces of codes (not completely sure it is the right way), I can not observe a significant difference anymore

using the pre computing

@btime begin
for k in 1:$n_basefuncs
     $∇uv[k] = shape_gradient($cellvalues, $q_point, k) 
end           
for i in 1:$n_basefuncs
    ∇v = $∇uv[i]
    for j in 1:$n_basefuncs
         u = $∇uv[j]                        
         $Ke[i, j] += (∇v ⋅ ∇u) * $dΩ
     end
end
end

  139.512 ns (0 allocations: 0 bytes)
  111.476 ns (0 allocations: 0 bytes)
  142.031 ns (0 allocations: 0 bytes)
  111.493 ns (0 allocations: 0 bytes)
  139.515 ns (0 allocations: 0 bytes)
  111.469 ns (0 allocations: 0 bytes)
  142.025 ns (0 allocations: 0 bytes)
  111.829 ns (0 allocations: 0 bytes)
...

initial implementation

@btime begin         
for i in 1:$n_basefuncs
     ∇v = shape_gradient($cellvalues, $q_point, i)
      for j in 1:$n_basefuncs
           ∇u = shape_gradient($cellvalues, $q_point, j)
           $Ke[i, j] += (∇v ⋅ ∇u) * $dΩ
      end
end
end

  116.681 ns (0 allocations: 0 bytes)
  150.695 ns (0 allocations: 0 bytes)
  118.124 ns (0 allocations: 0 bytes)
  149.268 ns (0 allocations: 0 bytes)
  117.144 ns (0 allocations: 0 bytes)
  150.820 ns (0 allocations: 0 bytes)
  115.594 ns (0 allocations: 0 bytes)
  149.268 ns (0 allocations: 0 bytes)
...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiency of matrix assembly #561

{{title}}

Replies: 8 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Efficiency of matrix assembly #561

edljk Dec 15, 2022

Replies: 8 comments · 6 replies

termi-official Dec 15, 2022 Collaborator

edljk Dec 15, 2022 Author

fredrikekre Dec 15, 2022 Maintainer

edljk Dec 15, 2022 Author

fredrikekre Dec 15, 2022 Maintainer

koehlerson Dec 15, 2022 Collaborator

edljk Dec 15, 2022 Author

fredrikekre Dec 15, 2022 Maintainer

fredrikekre Dec 15, 2022 Maintainer

edljk Dec 15, 2022 Author

koehlerson Dec 15, 2022 Collaborator

fredrikekre Dec 15, 2022 Maintainer

termi-official Dec 15, 2022 Collaborator

edljk Dec 15, 2022 Author

edljk
Dec 15, 2022

Replies: 8 comments 6 replies

termi-official
Dec 15, 2022
Collaborator

edljk
Dec 15, 2022
Author

fredrikekre
Dec 15, 2022
Maintainer

edljk
Dec 15, 2022
Author

fredrikekre
Dec 15, 2022
Maintainer

koehlerson Dec 15, 2022
Collaborator

edljk Dec 15, 2022
Author

fredrikekre Dec 15, 2022
Maintainer

fredrikekre Dec 15, 2022
Maintainer

edljk
Dec 15, 2022
Author

koehlerson Dec 15, 2022
Collaborator

fredrikekre
Dec 15, 2022
Maintainer

termi-official Dec 15, 2022
Collaborator

edljk
Dec 15, 2022
Author