Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda heat example w quaditer #913

Draft
wants to merge 142 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 139 commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
a979fb2
Initial ideas
KnutAM Jan 11, 2024
298158c
Working implementation
KnutAM Jan 11, 2024
1794db3
Merge branch 'master' into kam/QuadraturePointIterator
KnutAM Feb 24, 2024
51ab4f2
Add static values version and improve interface
KnutAM Feb 24, 2024
22a7377
Add dev example and test
KnutAM Feb 24, 2024
18377f3
Merge branch 'master' into kam/QuadraturePointIterator
KnutAM Feb 28, 2024
27a3a96
Add StaticCellValues without stored cell coordinates
KnutAM Feb 28, 2024
95b5729
initial ideas
Abdelrahman912 May 7, 2024
d4e881d
minor changes
Abdelrahman912 May 14, 2024
f55b878
Merge branch 'Ferrite-FEM:master' into cuda-heat-example-w-quaditer
Abdelrahman912 May 14, 2024
c1ef6ad
add some abstractions
Abdelrahman912 May 23, 2024
394ac6a
add minor comment
Abdelrahman912 May 23, 2024
1f0df67
add z dierction for numerical integration
Abdelrahman912 May 30, 2024
3152042
add Float32
Abdelrahman912 Jun 4, 2024
aac5994
minor fix
Abdelrahman912 Jun 4, 2024
142f89a
init coloring implementation
Abdelrahman912 Jun 18, 2024
eaff534
init working on the assembler
Abdelrahman912 Jun 19, 2024
ffdc341
init gpu_assembler
Abdelrahman912 Jun 20, 2024
59595e8
implement naive gpu_assembler
Abdelrahman912 Jun 20, 2024
0e3cb21
minor fix
Abdelrahman912 Jun 20, 2024
687141d
use CuSparseMatrixCSC in assembler
Abdelrahman912 Jun 26, 2024
11d5a01
minor fix
Abdelrahman912 Jun 26, 2024
d5c951c
minor fix
Abdelrahman912 Jun 26, 2024
f4272a6
hoist dh, cellvalues, assembler outside the cuda loop
Abdelrahman912 Jun 26, 2024
d5cf949
add run_gpu macro
Abdelrahman912 Jun 26, 2024
2e52de1
init using int32 instead of int64 to reduce number of registers
Abdelrahman912 Jul 3, 2024
2cd0168
finish use int32
Abdelrahman912 Jul 3, 2024
54922ab
stupid way to circumvent rubbish values
Abdelrahman912 Jul 4, 2024
9406ff9
add discorse ref
Abdelrahman912 Jul 4, 2024
8fedba5
add ncu benchmark
Abdelrahman912 Jul 4, 2024
8bd417a
fix error in benchmark and add ref.
Abdelrahman912 Jul 4, 2024
abf11b6
set the code for debugging
Abdelrahman912 Jul 8, 2024
4f85cf5
init test
Abdelrahman912 Jul 8, 2024
4935b70
fix adapt issue
Abdelrahman912 Jul 8, 2024
188cceb
remove unnecessary cushow
Abdelrahman912 Jul 8, 2024
9c904e4
add heat equation main test set
Abdelrahman912 Jul 8, 2024
06432db
remove unncessary comments
Abdelrahman912 Jul 8, 2024
a67caaa
add nsys benchmark
Abdelrahman912 Jul 8, 2024
ecee17f
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Jul 8, 2024
60edda9
fix some issues regarding the merge
Abdelrahman912 Jul 8, 2024
063ff7a
minor fix
Abdelrahman912 Jul 8, 2024
9206be3
remove nsight files
Abdelrahman912 Jul 8, 2024
1eeb568
minor fix
Abdelrahman912 Jul 8, 2024
5e339a0
add comments
Abdelrahman912 Jul 8, 2024
204f3be
minor fix
Abdelrahman912 Jul 8, 2024
0f2e6b7
add comments
Abdelrahman912 Jul 8, 2024
7100e0a
fix for CI
Abdelrahman912 Jul 8, 2024
f129449
fix for CI
Abdelrahman912 Jul 8, 2024
618adb5
CI fix
Abdelrahman912 Jul 8, 2024
78f120c
ci
Abdelrahman912 Jul 8, 2024
4971cba
minor fix
Abdelrahman912 Jul 8, 2024
ea8451c
fix ci
Abdelrahman912 Jul 8, 2024
986c5db
remove file
Abdelrahman912 Jul 8, 2024
f93fdfb
add CUDA to docs project
Abdelrahman912 Jul 8, 2024
f442ae2
add v2 for gpu_heat_equation
Abdelrahman912 Jul 15, 2024
81274d5
add adapt to docs
Abdelrahman912 Jul 15, 2024
fbc05ed
minor fix
Abdelrahman912 Jul 22, 2024
506328c
init assemble per dof
Abdelrahman912 Jul 22, 2024
b505189
assemble global v3
Abdelrahman912 Jul 22, 2024
b0a94aa
minor fix
Abdelrahman912 Jul 22, 2024
aa3d1ae
add comment + start in v4
Abdelrahman912 Jul 31, 2024
c8cf6fe
add map dof to elements
Abdelrahman912 Jul 31, 2024
8a4523d
add 3d array for local matrices
Abdelrahman912 Aug 1, 2024
9617a4f
init code for v4
Abdelrahman912 Aug 1, 2024
427a6b0
fix bug w assemble global in v4
Abdelrahman912 Aug 5, 2024
bbed047
precommit fix
Abdelrahman912 Aug 5, 2024
85c055c
add preserve ref
Abdelrahman912 Aug 5, 2024
2b77613
fix precommit
Abdelrahman912 Aug 5, 2024
f9c70ab
fix logic error in v4
Abdelrahman912 Sep 7, 2024
0519016
init shared array usage
Abdelrahman912 Sep 9, 2024
5752676
optimize threads for dynamic shared memory threshold
Abdelrahman912 Sep 10, 2024
0fe023c
fix bug in dynamic shared mem
Abdelrahman912 Sep 11, 2024
a352612
minor fix
Abdelrahman912 Sep 11, 2024
2a6120a
init kernel abstractions
Abdelrahman912 Sep 16, 2024
67face7
add local matrix kernel
Abdelrahman912 Sep 16, 2024
aca8a6f
add global matrix kernel with CUDA dependency
Abdelrahman912 Sep 16, 2024
9e4d592
minor change
Abdelrahman912 Sep 16, 2024
6114495
init working KS implementation (still CUDA dependent )
Abdelrahman912 Sep 17, 2024
2a8abeb
remove cuda dependency
Abdelrahman912 Sep 18, 2024
630017c
add refrence to
Abdelrahman912 Sep 18, 2024
fc26670
use Atomix.jl
Abdelrahman912 Sep 20, 2024
ae7bc93
init v4 ks
Abdelrahman912 Sep 20, 2024
0e28f14
init cell cache prototype
Abdelrahman912 Sep 23, 2024
0eb376d
working gpu cell cache
Abdelrahman912 Sep 23, 2024
8f7a182
fix types
Abdelrahman912 Sep 23, 2024
9b1567d
init gpu cell iterator
Abdelrahman912 Sep 23, 2024
a08ab97
add iterator
Abdelrahman912 Sep 25, 2024
b34c43b
add stride kernel
Abdelrahman912 Sep 26, 2024
b289b69
minor fix
Abdelrahman912 Sep 26, 2024
b2c0347
fix blocks, threads for kernel launch
Abdelrahman912 Sep 27, 2024
b87d78b
minor fix for thread, blocks
Abdelrahman912 Sep 27, 2024
e10e2f6
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 3, 2024
42a28e1
add gpu as extension
Abdelrahman912 Oct 4, 2024
e59b8b8
add some documentaion and remove unnecessary implementations.
Abdelrahman912 Oct 7, 2024
e7157e4
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 10, 2024
e4b194d
init unit test
Abdelrahman912 Oct 10, 2024
a613107
init test for iterators
Abdelrahman912 Oct 11, 2024
113a7a2
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 11, 2024
d1e831e
add tests in GPU/
Abdelrahman912 Oct 11, 2024
7f8fa3c
add test local ke and fe
Abdelrahman912 Oct 11, 2024
c38419c
minor fix
Abdelrahman912 Oct 11, 2024
190e43e
fix ci - 1
Abdelrahman912 Oct 11, 2024
763c6b5
fix ci-2
Abdelrahman912 Oct 11, 2024
1b6060d
minor edit
Abdelrahman912 Oct 11, 2024
d767668
fix ci
Abdelrahman912 Oct 12, 2024
726ea9e
ci
Abdelrahman912 Oct 12, 2024
8590aa4
fix ci
Abdelrahman912 Oct 12, 2024
39e1f0c
minor edit
Abdelrahman912 Oct 12, 2024
f0cd305
add validation for cuda, minor fix, seperate unit tests into multiple…
Abdelrahman912 Oct 14, 2024
9d4e8b9
fix precommit shit
Abdelrahman912 Oct 14, 2024
12f64bb
try documentation test fix
Abdelrahman912 Oct 14, 2024
361333b
documentation test fix
Abdelrahman912 Oct 14, 2024
e31c6e3
make ci happy
Abdelrahman912 Oct 14, 2024
626dec2
change kernel launch, init adapt test
Abdelrahman912 Oct 15, 2024
fbc1b4b
minor fix
Abdelrahman912 Oct 15, 2024
ea83925
add test_adapt, some comments
Abdelrahman912 Oct 15, 2024
a356d8d
fix precommit
Abdelrahman912 Oct 15, 2024
ee1f77c
init cpu multi threading
Abdelrahman912 Nov 4, 2024
fb7e1fc
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Nov 4, 2024
b38ab72
hot fix for buggy assembly logic
Abdelrahman912 Nov 5, 2024
adb166a
minor fix
Abdelrahman912 Nov 6, 2024
6300a4a
test sth
Abdelrahman912 Nov 6, 2024
b7301c2
precommit fix
Abdelrahman912 Nov 6, 2024
18f47b8
fix explicit imports
Abdelrahman912 Nov 6, 2024
f6e9cc6
add fillzero
Abdelrahman912 Nov 6, 2024
8a796de
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Nov 6, 2024
75e89ed
minor fix for gpu assembly
Abdelrahman912 Nov 6, 2024
a77c347
minor minor fix
Abdelrahman912 Nov 6, 2024
7338788
make cache mutable
Abdelrahman912 Nov 12, 2024
cbab665
put the coloring stuff in the init
Abdelrahman912 Nov 12, 2024
1c81281
minor fix
Abdelrahman912 Nov 12, 2024
d42bcab
code for benchmarking (to be removed)
Abdelrahman912 Nov 13, 2024
1ab1650
rm cpu multithreading benchmark code
Abdelrahman912 Nov 13, 2024
bc8ec95
init fix for higher order approximations in gpu
Abdelrahman912 Nov 18, 2024
c7f4b0f
add working imp for global gpu mem
Abdelrahman912 Nov 18, 2024
d4d5967
add some comments
Abdelrahman912 Nov 18, 2024
3b2196b
trying to make the ci happy
Abdelrahman912 Nov 19, 2024
825d257
minor fix
Abdelrahman912 Nov 19, 2024
6109bd1
comment gpu related stuff in eg to pass ci
Abdelrahman912 Nov 19, 2024
9caa60b
some review fixes
Abdelrahman912 Nov 25, 2024
868d559
some review fixes
Abdelrahman912 Nov 25, 2024
a4637b6
add allocate_matrix for CuSparseMatrix
Abdelrahman912 Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,14 @@ Tensors = "48a634ad-e948-5137-8d70-aa71f2a747f4"
WriteVTK = "64499a7a-5c06-52f2-abe2-ccb03c286192"

[weakdeps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Metis = "2679e427-3c69-5b7f-982b-ece356f1e94b"

[extensions]
FerriteBlockArrays = "BlockArrays"
FerriteGPU = ["CUDA", "Adapt"]
FerriteMetis = "Metis"

[compat]
Expand All @@ -34,6 +37,7 @@ Preferences = "1"
Reexport = "1"
StaticArrays = "1"
Tensors = "1.14"
TimerOutputs = "0.5.25"
WriteVTK = "1.13"
julia = "1.10"

Expand Down
295 changes: 269 additions & 26 deletions docs/Manifest.toml

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Changelog = "5217a498-cd5d-4ec6-b8c2-9b85a09b6e3e"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DocumenterCitations = "daee34ce-89f3-4625-b898-19384cb65244"
Expand Down
182 changes: 182 additions & 0 deletions docs/src/literate-tutorials/gpu_qp_heat_equation.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
using Ferrite
using StaticArrays
using SparseArrays
using CUDA


left = Tensor{1, 2, Float32}((0, -0)) # define the left bottom corner of the grid.

right = Tensor{1, 2, Float32}((1000.0, 1000.0)) # define the right top corner of the grid.


grid = generate_grid(Quadrilateral, (1000, 1000), left, right)
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved


ip = Lagrange{RefQuadrilateral, 2}() # define the interpolation function (i.e. Bilinear lagrange)


qr = QuadratureRule{RefQuadrilateral}(Float32, 3)


cellvalues = CellValues(Float32, qr, ip)
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved


dh = DofHandler(grid)

add!(dh, :u, ip)

close!(dh);
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved


dh |> get_grid
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved

# Standard assembly of the element.
function assemble_element_std!(Ke::Matrix, fe::Vector, cellvalues::CellValues)
n_basefuncs = getnbasefunctions(cellvalues)
## Loop over quadrature points
for q_point in 1:getnquadpoints(cellvalues)
## Get the quadrature weight
dΩ = getdetJdV(cellvalues, q_point)
## Loop over test shape functions
for i in 1:n_basefuncs
δu = shape_value(cellvalues, q_point, i)
∇δu = shape_gradient(cellvalues, q_point, i)
## Add contribution to fe
fe[i] += δu * dΩ
## Loop over trial shape functions
for j in 1:n_basefuncs
∇u = shape_gradient(cellvalues, q_point, j)
## Add contribution to Ke
Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
end
end
end
return Ke, fe
end


function create_buffers(cellvalues, dh)
f = zeros(ndofs(dh))
K = allocate_matrix(dh)
assembler = start_assemble(K, f)
## Local quantities
n_basefuncs = getnbasefunctions(cellvalues)
Ke = zeros(n_basefuncs, n_basefuncs)
fe = zeros(n_basefuncs)
return (; f, K, assembler, Ke, fe)
end
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved


# Standard global assembly


function assemble_global!(cellvalues, dh::DofHandler, qp_iter::Val{QPiter}) where {QPiter}
(; f, K, assembler, Ke, fe) = create_buffers(cellvalues, dh)
## Loop over all cels
for cell in CellIterator(dh)
fill!(Ke, 0)
fill!(fe, 0)
if QPiter
## reinit!(cellvalues, getcoordinates(cell))
assemble_element_qpiter!(Ke, fe, cellvalues, getcoordinates(cell))
else
## Reinitialize cellvalues for this cell
reinit!(cellvalues, cell)
## Compute element contribution
assemble_element_std!(Ke, fe, cellvalues)
end
## Assemble Ke and fe into K and f
assemble!(assembler, celldofs(cell), Ke, fe)
end
return K, f
end
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved


## gpu version of element assembly


function assemble_element!(Ke, fe, cv, cell)
n_basefuncs = getnbasefunctions(cv)
for qv in Ferrite.QuadratureValuesIterator(cv, getcoordinates(cell))
## Get the quadrature weight
dΩ = getdetJdV(qv)
## Loop over test shape functions
for i in 1:n_basefuncs
δu = shape_value(qv, i)
∇u = shape_gradient(qv, i)
## Add contribution to fe
fe[i] += δu * dΩ
## fe_shared[tx,i] += δu * dΩ
## Loop over trial shape functions
for j in 1:n_basefuncs
∇δu = shape_gradient(qv, j)
## Add contribution to Ke
Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
end
end
end
return
end


# gpu version of global assembly
function assemble_gpu!(Kgpu, fgpu, cv, dh)
n_basefuncs = getnbasefunctions(cv)
assembler = start_assemble(Kgpu, fgpu; fillzero = false) ## has to be always false
termi-official marked this conversation as resolved.
Show resolved Hide resolved
for cell in CellIterator(dh, convert(Int32, n_basefuncs))
Ke = cellke(cell)
fe = cellfe(cell)
assemble_element!(Ke, fe, cv, cell)
assemble!(assembler, celldofs(cell), Ke, fe)
end
return nothing
end


n_basefuncs = getnbasefunctions(cellvalues)

## Allocate CPU matrix
K = allocate_matrix(SparseMatrixCSC{Float32, Int32}, dh);

#K = allocate_matrix(SparseMatrixCSC{Float64, Int64}, dh);
f = zeros(ndofs(dh));
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved

# Allocate GPU matrix
## commented to pass the test
## Kgpu = CUSPARSE.CuSparseMatrixCSC(K);
## fgpu = CUDA.zeros(ndofs(dh));

n_cells = dh |> get_grid |> getncells

# Kernel configuration
## GPU kernel ##
## commented to pass the test
## First init the kernel with the required config.
## gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh))
## Then launch the kernel
## gpu_kernel |> launch! or gpu_kernel()
## gpu_kernel()

## CPU kernel ##
## cpu_kernel = init_kernel(BackendCPU, n_cells, n_basefuncs, assemble_gpu!, (K, f, cellvalues, dh));
## cpu_kernel()

stassy(cv, dh) = assemble_global!(cv, dh, Val(false))

norm(K)
## commented to pass the test
## norm(Kgpu)
Kstd, Fstd = stassy(cellvalues, dh);
norm(Kstd)


## GPU Benchmarking, remove when not needed ##
## function bench_gpu(n_cells, n_basefuncs, cellvalues, dh)
## Kgpu = CUSPARSE.CuSparseMatrixCSC(K);
## fgpu = CUDA.zeros(ndofs(dh));
## gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh))
## gpu_kernel()
## end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing the analogue benchmark using QuadraturePointIterator

## CUDA.@time bench_gpu(n_cells, n_basefuncs, cellvalues, dh)
## CUDA.@profile trace = true bench_gpu(n_cells, n_basefuncs, cellvalues, dh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For benchmarking purposes we should separate setup (and hence setup cost) from the actual assembly performance. E.g.

function setup_bench_gpu(n_cells, n_basefuncs, cellvalues, dh)
    Kgpu = CUSPARSE.CuSparseMatrixCSC(K);
    fgpu = CUDA.zeros(ndofs(dh));
    gpu_kernel = init_kernel(BackendCUDA, n_cells, n_basefuncs, assemble_gpu!, (Kgpu, fgpu, cellvalues, dh))
end

gpu_kernel = setup_bench_gpu(n_cells, n_basefuncs, cellvalues, dh);
CUDA.@time gpu_kernel()
CUDA.@profile trace = true gpu_kernel()

This uncovers on my machine that there is still memory transfer going on in the actual assembly code:

julia> CUDA.@profile trace = true gpu_kernel()
Profiler ran for 251.96 ms, capturing 1522 events.

Host-side activity: calling CUDA APIs took 110.01 ms (43.66% of the trace)
┌──────┬───────────┬───────────┬────────┬─────────────────────────┬────────────────────────────┐
│   ID │     Start │      Time │ Thread │ Name                    │                    Details │
├──────┼───────────┼───────────┼────────┼─────────────────────────┼────────────────────────────┤
│   21 │  21.55 ms │  27.66 µs │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│   35 │   21.6 ms │   5.01 µs │      1 │ cuStreamSynchronize     │                          - │
│   44 │  21.62 ms │   6.08 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│   53 │  27.73 ms │   21.7 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│   67 │  27.76 ms │   4.77 µs │      1 │ cuStreamSynchronize     │                          - │
│   76 │  27.77 ms │   3.72 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│   85 │  32.42 ms │  24.08 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│   99 │  32.46 ms │   4.77 µs │      1 │ cuStreamSynchronize     │                          - │
│  108 │  32.47 ms │ 458.72 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  117 │  32.94 ms │   5.01 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│  230 │  33.01 ms │   1.43 µs │      1 │ cuStreamSynchronize     │                          - │
│  239 │  33.02 ms │ 934.36 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  248 │  35.83 ms │  23.84 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  262 │  35.87 ms │   5.01 µs │      1 │ cuStreamSynchronize     │                          - │
│  271 │  35.88 ms │ 458.96 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  288 │  36.39 ms │   7.15 µs │      1 │ cuMemAllocFromPoolAsync │   7.277 MiB, device memory │
│  300 │  36.41 ms │  36.48 µs │      1 │ cuMemsetD32Async        │                          - │
│  305 │  36.45 ms │   3.34 µs │      1 │ cuMemAllocFromPoolAsync │ 828.000 KiB, device memory │
│  317 │  36.46 ms │   7.63 µs │      1 │ cuMemsetD32Async        │                          - │
│  338 │ 127.38 ms │  29.33 µs │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│  352 │ 127.44 ms │   5.96 µs │      1 │ cuStreamSynchronize     │                          - │
│  361 │ 127.46 ms │   4.66 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  370 │ 132.15 ms │  21.93 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│  384 │ 132.18 ms │   4.77 µs │      1 │ cuStreamSynchronize     │                          - │
│  393 │ 132.19 ms │   3.69 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  402 │ 136.72 ms │  22.41 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  416 │ 136.76 ms │   4.29 µs │      1 │ cuStreamSynchronize     │                          - │
│  425 │ 136.78 ms │ 614.17 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  434 │  137.4 ms │   5.96 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│  547 │ 137.48 ms │   1.43 µs │      1 │ cuStreamSynchronize     │                          - │
│  556 │ 137.49 ms │  922.2 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  565 │ 140.29 ms │  28.13 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  579 │ 140.34 ms │   4.05 µs │      1 │ cuStreamSynchronize     │                          - │
│  588 │ 140.35 ms │ 586.51 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  605 │ 140.99 ms │   36.0 µs │      1 │ cuLaunchKernel          │                          - │
│ 1383 │ 141.69 ms │   36.0 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1388 │ 141.73 ms │   6.68 µs │      2 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│ 1393 │ 141.74 ms │   7.15 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1398 │ 141.76 ms │   7.63 µs │      2 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│ 1403 │ 141.77 ms │   7.63 µs │      2 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│ 1408 │ 141.83 ms │   5.96 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1413 │ 141.84 ms │   5.96 µs │      2 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│ 1418 │ 141.85 ms │   5.48 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1423 │ 141.86 ms │   6.44 µs │      2 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│ 1428 │ 141.87 ms │    6.2 µs │      2 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│ 1433 │ 141.89 ms │   5.72 µs │      2 │ cuMemFreeAsync          │ 828.000 KiB, device memory │
│ 1438 │  141.9 ms │   5.72 µs │      2 │ cuMemFreeAsync          │   7.277 MiB, device memory │
│ 1443 │ 141.91 ms │   5.25 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1448 │ 141.92 ms │    6.2 µs │      2 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│ 1453 │ 141.93 ms │   5.01 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1458 │ 141.94 ms │   7.63 µs │      2 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│ 1463 │ 141.95 ms │   6.44 µs │      2 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│ 1468 │ 141.96 ms │   6.91 µs │      2 │ cuMemFreeAsync          │     8 bytes, device memory │
│ 1473 │ 141.98 ms │   5.96 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1478 │ 141.99 ms │   5.72 µs │      2 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│ 1483 │  142.0 ms │   4.53 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 1488 │ 142.01 ms │   5.96 µs │      2 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│ 1493 │ 142.02 ms │   5.48 µs │      2 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│ 1498 │ 142.03 ms │   7.15 µs │      2 │ cuMemFreeAsync          │ 828.000 KiB, device memory │
│ 1503 │ 142.04 ms │   5.48 µs │      2 │ cuMemFreeAsync          │   7.277 MiB, device memory │
│ 1506 │ 142.05 ms │  109.8 ms │      2 │ cuStreamSynchronize     │                          - │
└──────┴───────────┴───────────┴────────┴─────────────────────────┴────────────────────────────┘

Device-side activity: GPU was busy for 132.12 ms (52.44% of the trace)
┌─────┬───────────┬───────────┬─────────┬────────┬──────┬─────────────┬───────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  ID │     Start │      Time │ Threads │ Blocks │ Regs │        Size │    Throughput │ Name                                                                                                                                                                                             ⋯
├─────┼───────────┼───────────┼─────────┼────────┼──────┼─────────────┼───────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│  44 │  21.79 ms │   5.93 ms │       - │      - │    - │  34.332 MiB │   5.654 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│  76 │  27.88 ms │   3.67 ms │       - │      - │    - │  30.518 MiB │   8.131 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 108 │  32.58 ms │ 411.99 µs │       - │      - │    - │   3.815 MiB │   9.042 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 239 │  33.11 ms │ 895.74 µs │       - │      - │    - │   7.645 MiB │   8.334 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 271 │   36.0 ms │  410.8 µs │       - │      - │    - │   3.815 MiB │   9.068 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 300 │  36.45 ms │   17.4 µs │       - │      - │    - │   7.277 MiB │ 408.329 GiB/s │ [set device memory]                                                                                                                                                                              ⋯
│ 317 │  36.47 ms │   2.86 µs │       - │      - │    - │ 828.000 KiB │ 276.000 GiB/s │ [set device memory]                                                                                                                                                                              ⋯
│ 361 │ 127.59 ms │   4.58 ms │       - │      - │    - │  34.332 MiB │   7.327 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 393 │ 132.29 ms │   3.64 ms │       - │      - │    - │  30.518 MiB │   8.197 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 425 │ 136.96 ms │  490.9 µs │       - │      - │    - │   3.815 MiB │   7.589 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 556 │ 137.57 ms │ 884.53 µs │       - │      - │    - │   7.645 MiB │   8.440 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 588 │ 140.51 ms │ 493.29 µs │       - │      - │    - │   3.815 MiB │   7.552 GiB/s │ [copy pageable to device memory]                                                                                                                                                                 ⋯
│ 605 │ 141.14 ms │  110.7 ms │     256 │     92 │  129 │           - │             - │ assemble_gpu_(CuSparseDeviceMatrixCSC<Float32, Int32, 1l>, CuDeviceArray<Float32, 1l, 1l>, StaticCellValues<StaticInterpolationValues<Lagrange<RefHypercube<2l>, 2l>, 9l, 9l, Float32, SArray<Tu ⋯
└─────┴───────────┴───────────┴─────────┴────────┴──────┴─────────────┴───────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Hence we should adapt all objects before the kernel launch. This has the advantage that we do not need to transfer everything from CPU to GPU memory space when assembling multiple times.

Furthermore, the actual allocation should be

Kgpu = allocate_matrix(CUSPARSE.CuSparseMatrixCSC{Float32, Int32}, dh)

instead of

Kgpu = CUSPARSE.CuSparseMatrixCSC(K);

20 changes: 20 additions & 0 deletions ext/FerriteGPU.jl
Abdelrahman912 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
module FerriteGPU
# This module represnets an extenssion of Ferrite.jl that uses GPU backend for assembly, namely CUDA.jl

using Ferrite
using CUDA
using Adapt
using Base:
@propagate_inbounds
using SparseArrays:
AbstractSparseArray
using StaticArrays:
SVector, MVector


include("GPU/gpu_assembler.jl")
include("GPU/CUDAKernelLauncher.jl")
include("GPU/cuda_iterator.jl")
include("GPU/adapt.jl")

end
139 changes: 139 additions & 0 deletions ext/GPU/CUDAKernelLauncher.jl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is essentially user code. Please keep it for now here until we decide where exactly to put it (e.g. just some how-to or its own package).

Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
## This file manifsts the launch of GPU kernel on CUDA backend ##

"""
Ferrite.init_kernel(::Type{BackendCUDA}, n_cells::Ti, n_basefuncs::Ti, kernel::Function, args::Tuple) where {Ti <: Integer}

Initialize a CUDA kernel for the Ferrite framework.

# Arguments
- `::Type{BackendCUDA}`: Specifies the CUDA backend.
- `n_cells::Ti`: Number of cells in the problem.
- `n_basefuncs::Ti`: Number of shape functions per cell.
- `kernel::Function`: The CUDA kernel function to execute.
- `args::Tuple`: Tuple of arguments for the kernel.

# Returns
- A `LazyKernel` object encapsulating the kernel and its execution configuration.

# Errors
Throws an `ArgumentError` if CUDA is not functional (e.g., due to missing drivers or improper installation).
"""
function Ferrite.init_kernel(::Type{BackendCUDA}, n_cells::Ti, n_basefuncs::Ti, kernel::Function, args::Tuple) where {Ti <: Integer}
if CUDA.functional()
return LazyKernel(n_cells, n_basefuncs, kernel, args, BackendCUDA)

Check warning on line 23 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L21-L23

Added lines #L21 - L23 were not covered by tests
else
throw(ArgumentError("CUDA is not functional, please check your GPU driver and CUDA installation"))

Check warning on line 25 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L25

Added line #L25 was not covered by tests
end
end

"""
Ferrite.launch!(kernel::LazyKernel{Ti, BackendCUDA}) where {Ti}

Launch a CUDA kernel encapsulated in a `LazyKernel` object.

# Arguments
- `kernel::LazyKernel`: The kernel to be launched, along with its configuration.

# Returns
- `nothing`: Indicates that the kernel was launched and synchronized successfully.
"""
function Ferrite.launch!(kernel::LazyKernel{Ti, BackendCUDA}) where {Ti}
n_cells = kernel.n_cells
n_basefuncs = kernel.n_basefuncs
ker = kernel.kernel
args = kernel.args
kernel = @cuda launch = false ker(args...)
config = launch_configuration(kernel.fun)
threads = convert(Ti, min(n_cells, config.threads, 256))
shared_mem = _calculate_shared_memory(threads, n_basefuncs)
blocks = _calculate_nblocks(threads, n_cells)

Check warning on line 49 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L40-L49

Added lines #L40 - L49 were not covered by tests

## use dynamic shared memory if possible
_can_use_dynshmem(shared_mem) && return CUDA.@sync kernel(args...; threads, blocks, shmem = shared_mem)

Check warning on line 52 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L52

Added line #L52 was not covered by tests

## otherwise use global memory
nes = blocks * threads
kes = CUDA.zeros(Float32, nes, n_basefuncs, n_basefuncs)
fes = CUDA.zeros(Float32, nes, n_basefuncs)
Comment on lines +56 to +57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some trouble understanding how

  1. The assembly code can access these buffers
  2. How these buffers are used to assemble K and f

Can you quickly point me to the relevant code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this actually a tough design issue that I have faced and to be honest I greatly improvised on this one, but essentially the local matrices are being transferred in two steps:

  1. encapsulate them in a new dohandler that acts as a wrapper only and replace the standard dofhandler that is passed in the arguments by the new one -> _to_localdh
  kes = CUDA.zeros(Float32, nes, n_basefuncs, n_basefuncs)
    fes = CUDA.zeros(Float32, nes, n_basefuncs)
    args = _to_localdh(args, kes, fes) ## add the kes and fes to the new dof handler
    CUDA.@sync @cuda blocks = blocks threads = threads ker(args...)
struct LocalsGPUDofHandler{DH <: AbstractDofHandler, LOCAL_MATRICES, LOCAL_VECTORS} <: AbstractGPUDofHandler
    dh::DH
    Kes::LOCAL_MATRICES
    fes::LOCAL_VECTORS
end
  1. since the dohandler is already replaced before executing the kernel, the celliterator that is going to be invoked is the following one, which basically access the right partition of the local stiffness matrix and load vector and pass it to the cell cache.
function Ferrite.CellIterator(dh_::Ferrite.LocalsGPUDofHandler, ::Int32)
    dh = dh_ |> dofhandler
    grid = get_grid(dh)
    n_cells = grid |> getncells |> Int32
    bd = blockDim().x
    local_thread_id = threadIdx().x
    global_thread_id = (blockIdx().x - Int32(1)) * bd + local_thread_id
    ke = cellke(dh_, global_thread_id)
    fe = cellfe(dh_, global_thread_id)
    return CUDAGlobalCellIterator(dh, grid, n_cells, ke, fe, local_thread_id)
end

args = _to_localdh(args, kes, fes)
CUDA.@sync @cuda blocks = blocks threads = threads ker(args...)
return nothing

Check warning on line 60 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L55-L60

Added lines #L55 - L60 were not covered by tests
end

"""
_to_localdh(args::Tuple, kes::AbstractArray, fes::AbstractArray)

Convert a global degree-of-freedom handler to a local handler for use on the GPU.

# Arguments
- `args::Tuple`: Kernel arguments.
- `kes::AbstractArray`: GPU storage for element stiffness matrices.
- `fes::AbstractArray`: GPU storage for element force vectors.

# Returns
- `Tuple`: Updated arguments tuple with the degree-of-freedom handler replaced by a local GPU handler.

# Errors
Throws an `ErrorException` if no `AbstractDofHandler` is found in `args`.
"""
function _to_localdh(args::Tuple, kes::AbstractArray, fes::AbstractArray)
dh_index = findfirst(x -> x isa Ferrite.AbstractDofHandler, args)
dh_index !== nothing || throw(ErrorException("No subtype of AbstractDofHandler found in the arguments"))
arr = args |> collect
local_dh = LocalsGPUDofHandler(arr[dh_index], kes, fes)
arr[dh_index] = local_dh
return Tuple(arr)

Check warning on line 85 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L79-L85

Added lines #L79 - L85 were not covered by tests
end

"""
_calculate_shared_memory(threads::Integer, n_basefuncs::Integer)

Calculate the shared memory required for kernel execution.

# Arguments
- `threads::Integer`: Number of threads per block.
- `n_basefuncs::Integer`: Number of basis functions per cell.

# Returns
- `Integer`: Amount of shared memory in bytes.
"""
function _calculate_shared_memory(threads::Integer, n_basefuncs::Integer)
return sizeof(Float32) * (threads) * (n_basefuncs) * n_basefuncs + sizeof(Float32) * (threads) * n_basefuncs

Check warning on line 101 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L100-L101

Added lines #L100 - L101 were not covered by tests
end

"""
_can_use_dynshmem(required_shmem::Integer)

Check if the GPU supports the required amount of dynamic shared memory.

# Arguments
- `required_shmem::Integer`: Required shared memory size in bytes.

# Returns
- `Bool`: `true` if the GPU can provide the required shared memory; `false` otherwise.
"""
function _can_use_dynshmem(required_shmem::Integer)
dev = device()
MAX_DYN_SHMEM = CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK)
return required_shmem < MAX_DYN_SHMEM

Check warning on line 118 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L115-L118

Added lines #L115 - L118 were not covered by tests
end

"""
_calculate_nblocks(threads::Integer, n_cells::Integer)

Calculate the number of blocks required for kernel execution.

# Arguments
- `threads::Integer`: Number of threads per block.
- `n_cells::Integer`: Total number of cells to process.

# Returns
- `Integer`: Number of blocks to launch.
"""
function _calculate_nblocks(threads::Integer, n_cells::Integer)
dev = device()
no_sms = CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT)
required_blocks = cld(n_cells, threads)
required_blocks < 2 * no_sms || return 2 * no_sms
return required_blocks

Check warning on line 138 in ext/GPU/CUDAKernelLauncher.jl

View check run for this annotation

Codecov / codecov/patch

ext/GPU/CUDAKernelLauncher.jl#L133-L138

Added lines #L133 - L138 were not covered by tests
end
Loading