Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda heat example w quaditer #913

Draft
wants to merge 139 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
a979fb2
Initial ideas
KnutAM Jan 11, 2024
298158c
Working implementation
KnutAM Jan 11, 2024
1794db3
Merge branch 'master' into kam/QuadraturePointIterator
KnutAM Feb 24, 2024
51ab4f2
Add static values version and improve interface
KnutAM Feb 24, 2024
22a7377
Add dev example and test
KnutAM Feb 24, 2024
18377f3
Merge branch 'master' into kam/QuadraturePointIterator
KnutAM Feb 28, 2024
27a3a96
Add StaticCellValues without stored cell coordinates
KnutAM Feb 28, 2024
95b5729
initial ideas
Abdelrahman912 May 7, 2024
d4e881d
minor changes
Abdelrahman912 May 14, 2024
f55b878
Merge branch 'Ferrite-FEM:master' into cuda-heat-example-w-quaditer
Abdelrahman912 May 14, 2024
c1ef6ad
add some abstractions
Abdelrahman912 May 23, 2024
394ac6a
add minor comment
Abdelrahman912 May 23, 2024
1f0df67
add z dierction for numerical integration
Abdelrahman912 May 30, 2024
3152042
add Float32
Abdelrahman912 Jun 4, 2024
aac5994
minor fix
Abdelrahman912 Jun 4, 2024
142f89a
init coloring implementation
Abdelrahman912 Jun 18, 2024
eaff534
init working on the assembler
Abdelrahman912 Jun 19, 2024
ffdc341
init gpu_assembler
Abdelrahman912 Jun 20, 2024
59595e8
implement naive gpu_assembler
Abdelrahman912 Jun 20, 2024
0e3cb21
minor fix
Abdelrahman912 Jun 20, 2024
687141d
use CuSparseMatrixCSC in assembler
Abdelrahman912 Jun 26, 2024
11d5a01
minor fix
Abdelrahman912 Jun 26, 2024
d5c951c
minor fix
Abdelrahman912 Jun 26, 2024
f4272a6
hoist dh, cellvalues, assembler outside the cuda loop
Abdelrahman912 Jun 26, 2024
d5cf949
add run_gpu macro
Abdelrahman912 Jun 26, 2024
2e52de1
init using int32 instead of int64 to reduce number of registers
Abdelrahman912 Jul 3, 2024
2cd0168
finish use int32
Abdelrahman912 Jul 3, 2024
54922ab
stupid way to circumvent rubbish values
Abdelrahman912 Jul 4, 2024
9406ff9
add discorse ref
Abdelrahman912 Jul 4, 2024
8fedba5
add ncu benchmark
Abdelrahman912 Jul 4, 2024
8bd417a
fix error in benchmark and add ref.
Abdelrahman912 Jul 4, 2024
abf11b6
set the code for debugging
Abdelrahman912 Jul 8, 2024
4f85cf5
init test
Abdelrahman912 Jul 8, 2024
4935b70
fix adapt issue
Abdelrahman912 Jul 8, 2024
188cceb
remove unnecessary cushow
Abdelrahman912 Jul 8, 2024
9c904e4
add heat equation main test set
Abdelrahman912 Jul 8, 2024
06432db
remove unncessary comments
Abdelrahman912 Jul 8, 2024
a67caaa
add nsys benchmark
Abdelrahman912 Jul 8, 2024
ecee17f
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Jul 8, 2024
60edda9
fix some issues regarding the merge
Abdelrahman912 Jul 8, 2024
063ff7a
minor fix
Abdelrahman912 Jul 8, 2024
9206be3
remove nsight files
Abdelrahman912 Jul 8, 2024
1eeb568
minor fix
Abdelrahman912 Jul 8, 2024
5e339a0
add comments
Abdelrahman912 Jul 8, 2024
204f3be
minor fix
Abdelrahman912 Jul 8, 2024
0f2e6b7
add comments
Abdelrahman912 Jul 8, 2024
7100e0a
fix for CI
Abdelrahman912 Jul 8, 2024
f129449
fix for CI
Abdelrahman912 Jul 8, 2024
618adb5
CI fix
Abdelrahman912 Jul 8, 2024
78f120c
ci
Abdelrahman912 Jul 8, 2024
4971cba
minor fix
Abdelrahman912 Jul 8, 2024
ea8451c
fix ci
Abdelrahman912 Jul 8, 2024
986c5db
remove file
Abdelrahman912 Jul 8, 2024
f93fdfb
add CUDA to docs project
Abdelrahman912 Jul 8, 2024
f442ae2
add v2 for gpu_heat_equation
Abdelrahman912 Jul 15, 2024
81274d5
add adapt to docs
Abdelrahman912 Jul 15, 2024
fbc05ed
minor fix
Abdelrahman912 Jul 22, 2024
506328c
init assemble per dof
Abdelrahman912 Jul 22, 2024
b505189
assemble global v3
Abdelrahman912 Jul 22, 2024
b0a94aa
minor fix
Abdelrahman912 Jul 22, 2024
aa3d1ae
add comment + start in v4
Abdelrahman912 Jul 31, 2024
c8cf6fe
add map dof to elements
Abdelrahman912 Jul 31, 2024
8a4523d
add 3d array for local matrices
Abdelrahman912 Aug 1, 2024
9617a4f
init code for v4
Abdelrahman912 Aug 1, 2024
427a6b0
fix bug w assemble global in v4
Abdelrahman912 Aug 5, 2024
bbed047
precommit fix
Abdelrahman912 Aug 5, 2024
85c055c
add preserve ref
Abdelrahman912 Aug 5, 2024
2b77613
fix precommit
Abdelrahman912 Aug 5, 2024
f9c70ab
fix logic error in v4
Abdelrahman912 Sep 7, 2024
0519016
init shared array usage
Abdelrahman912 Sep 9, 2024
5752676
optimize threads for dynamic shared memory threshold
Abdelrahman912 Sep 10, 2024
0fe023c
fix bug in dynamic shared mem
Abdelrahman912 Sep 11, 2024
a352612
minor fix
Abdelrahman912 Sep 11, 2024
2a6120a
init kernel abstractions
Abdelrahman912 Sep 16, 2024
67face7
add local matrix kernel
Abdelrahman912 Sep 16, 2024
aca8a6f
add global matrix kernel with CUDA dependency
Abdelrahman912 Sep 16, 2024
9e4d592
minor change
Abdelrahman912 Sep 16, 2024
6114495
init working KS implementation (still CUDA dependent )
Abdelrahman912 Sep 17, 2024
2a8abeb
remove cuda dependency
Abdelrahman912 Sep 18, 2024
630017c
add refrence to
Abdelrahman912 Sep 18, 2024
fc26670
use Atomix.jl
Abdelrahman912 Sep 20, 2024
ae7bc93
init v4 ks
Abdelrahman912 Sep 20, 2024
0e28f14
init cell cache prototype
Abdelrahman912 Sep 23, 2024
0eb376d
working gpu cell cache
Abdelrahman912 Sep 23, 2024
8f7a182
fix types
Abdelrahman912 Sep 23, 2024
9b1567d
init gpu cell iterator
Abdelrahman912 Sep 23, 2024
a08ab97
add iterator
Abdelrahman912 Sep 25, 2024
b34c43b
add stride kernel
Abdelrahman912 Sep 26, 2024
b289b69
minor fix
Abdelrahman912 Sep 26, 2024
b2c0347
fix blocks, threads for kernel launch
Abdelrahman912 Sep 27, 2024
b87d78b
minor fix for thread, blocks
Abdelrahman912 Sep 27, 2024
e10e2f6
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 3, 2024
42a28e1
add gpu as extension
Abdelrahman912 Oct 4, 2024
e59b8b8
add some documentaion and remove unnecessary implementations.
Abdelrahman912 Oct 7, 2024
e7157e4
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 10, 2024
e4b194d
init unit test
Abdelrahman912 Oct 10, 2024
a613107
init test for iterators
Abdelrahman912 Oct 11, 2024
113a7a2
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Oct 11, 2024
d1e831e
add tests in GPU/
Abdelrahman912 Oct 11, 2024
7f8fa3c
add test local ke and fe
Abdelrahman912 Oct 11, 2024
c38419c
minor fix
Abdelrahman912 Oct 11, 2024
190e43e
fix ci - 1
Abdelrahman912 Oct 11, 2024
763c6b5
fix ci-2
Abdelrahman912 Oct 11, 2024
1b6060d
minor edit
Abdelrahman912 Oct 11, 2024
d767668
fix ci
Abdelrahman912 Oct 12, 2024
726ea9e
ci
Abdelrahman912 Oct 12, 2024
8590aa4
fix ci
Abdelrahman912 Oct 12, 2024
39e1f0c
minor edit
Abdelrahman912 Oct 12, 2024
f0cd305
add validation for cuda, minor fix, seperate unit tests into multiple…
Abdelrahman912 Oct 14, 2024
9d4e8b9
fix precommit shit
Abdelrahman912 Oct 14, 2024
12f64bb
try documentation test fix
Abdelrahman912 Oct 14, 2024
361333b
documentation test fix
Abdelrahman912 Oct 14, 2024
e31c6e3
make ci happy
Abdelrahman912 Oct 14, 2024
626dec2
change kernel launch, init adapt test
Abdelrahman912 Oct 15, 2024
fbc1b4b
minor fix
Abdelrahman912 Oct 15, 2024
ea83925
add test_adapt, some comments
Abdelrahman912 Oct 15, 2024
a356d8d
fix precommit
Abdelrahman912 Oct 15, 2024
ee1f77c
init cpu multi threading
Abdelrahman912 Nov 4, 2024
fb7e1fc
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Nov 4, 2024
b38ab72
hot fix for buggy assembly logic
Abdelrahman912 Nov 5, 2024
adb166a
minor fix
Abdelrahman912 Nov 6, 2024
6300a4a
test sth
Abdelrahman912 Nov 6, 2024
b7301c2
precommit fix
Abdelrahman912 Nov 6, 2024
18f47b8
fix explicit imports
Abdelrahman912 Nov 6, 2024
f6e9cc6
add fillzero
Abdelrahman912 Nov 6, 2024
8a796de
Merge branch 'master' into cuda-heat-example-w-quaditer
Abdelrahman912 Nov 6, 2024
75e89ed
minor fix for gpu assembly
Abdelrahman912 Nov 6, 2024
a77c347
minor minor fix
Abdelrahman912 Nov 6, 2024
7338788
make cache mutable
Abdelrahman912 Nov 12, 2024
cbab665
put the coloring stuff in the init
Abdelrahman912 Nov 12, 2024
1c81281
minor fix
Abdelrahman912 Nov 12, 2024
d42bcab
code for benchmarking (to be removed)
Abdelrahman912 Nov 13, 2024
1ab1650
rm cpu multithreading benchmark code
Abdelrahman912 Nov 13, 2024
bc8ec95
init fix for higher order approximations in gpu
Abdelrahman912 Nov 18, 2024
c7f4b0f
add working imp for global gpu mem
Abdelrahman912 Nov 18, 2024
d4d5967
add some comments
Abdelrahman912 Nov 18, 2024
3b2196b
trying to make the ci happy
Abdelrahman912 Nov 19, 2024
825d257
minor fix
Abdelrahman912 Nov 19, 2024
6109bd1
comment gpu related stuff in eg to pass ci
Abdelrahman912 Nov 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@ uuid = "c061ca5d-56c9-439f-9c0e-210fe06d3992"
version = "0.3.14"

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Cthulhu = "f68482b8-f384-11e8-15f7-abe071a5a75f"
EnumX = "4e289a0a-7415-4d19-859d-a7e5c4648b56"
ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Expand Down
721 changes: 427 additions & 294 deletions docs/Manifest.toml

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DocumenterCitations = "daee34ce-89f3-4625-b898-19384cb65244"
Ferrite = "c061ca5d-56c9-439f-9c0e-210fe06d3992"
Expand Down
260 changes: 260 additions & 0 deletions docs/src/literate-tutorials/gpu_qp_heat_equation.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
using Ferrite, CUDA
using StaticArrays
using SparseArrays
using Adapt
using Test
using NVTX



left = Tensor{1,2,Float32}((0,-0)) # define the left bottom corner of the grid.
right = Tensor{1,2,Float32}((100.0,100.0)) # define the right top corner of the grid.


grid = generate_grid(Quadrilateral, (100, 100),left,right)


colors = create_coloring(grid) .|> (x -> Int32.(x)) # convert to Int32 to reduce number of registers


ip = Lagrange{RefQuadrilateral, 1}() # define the interpolation function (i.e. Bilinear lagrange)

qr = QuadratureRule{RefQuadrilateral}(Float32,2)


cellvalues = CellValues(Float32,qr, ip)


dh = DofHandler(grid)



add!(dh, :u, ip)

close!(dh);


# Standard assembly of the element.
function assemble_element_std!(Ke::Matrix, fe::Vector, cellvalues::CellValues)
n_basefuncs = getnbasefunctions(cellvalues)

# Loop over quadrature points
for q_point in 1:getnquadpoints(cellvalues)
# Get the quadrature weight
dΩ = getdetJdV(cellvalues, q_point)
# Loop over test shape functions
for i in 1:n_basefuncs
δu = shape_value(cellvalues, q_point, i)
∇δu = shape_gradient(cellvalues, q_point, i)
# Add contribution to fe
fe[i] += δu * dΩ
# Loop over trial shape functions
for j in 1:n_basefuncs
∇u = shape_gradient(cellvalues, q_point, j)
# Add contribution to Ke
Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
end
end
end
return Ke, fe
end


function create_buffers(cellvalues, dh)
f = zeros(ndofs(dh))
K = allocate_matrix(dh)
assembler = start_assemble(K, f)
## Local quantities
n_basefuncs = getnbasefunctions(cellvalues)
Ke = zeros(n_basefuncs, n_basefuncs)
fe = zeros(n_basefuncs)
return (;f, K, assembler, Ke, fe)
end


# Standard global assembly

function assemble_global!(cellvalues, dh::DofHandler,qp_iter::Val{QPiter}) where {QPiter}
(;f, K, assembler, Ke, fe) = create_buffers(cellvalues,dh)
# Loop over all cels
for cell in CellIterator(dh)
fill!(Ke, 0)
fill!(fe, 0)
if QPiter
#reinit!(cellvalues, getcoordinates(cell))
assemble_element_qpiter!(Ke, fe, cellvalues,getcoordinates(cell))
else
# Reinitialize cellvalues for this cell
reinit!(cellvalues, cell)
# Compute element contribution
assemble_element_std!(Ke, fe, cellvalues)
end
# Assemble Ke and fe into K and f
assemble!(assembler, celldofs(cell), Ke, fe)
end
return K, f
end



#=NVTX.@annotate=# function assemble_element_gpu!(assembler,cv,dh,n_cells_colored, eles_colored)
tx = threadIdx().x
bx = blockIdx().x
bd = blockDim().x
e_color = tx + (bx-Int32(1))*bd # element number per color

e_color ≤ n_cells_colored || return nothing # e here is the current element index.
n_basefuncs = getnbasefunctions(cv)
e = eles_colored[e_color]
cell_coords = getcoordinates(dh.grid, e)

ke = MMatrix{4,4,Float32}(undef) # Note: using n_basefuncs instead of 4 will throw an error because this type of dynamisim is not supported in GPU.
fill!(ke, 0.0f0)
fe = MVector{4,Float32}(undef)
fill!(fe, 0.0f0)
#Loop over quadrature points
for qv in Ferrite.QuadratureValuesIterator(cv,cell_coords)
## Get the quadrature weight
dΩ = getdetJdV(qv)
## Loop over test shape functions
for i in 1:n_basefuncs
δu = shape_value(qv, i)
∇δu = shape_gradient(qv, i)
## Add contribution to fe
fe[i] += δu * dΩ
## Loop over trial shape functions
for j in 1:n_basefuncs
∇u = shape_gradient(qv, j)
## Add contribution to Ke
ke[i,j] += (∇δu ⋅ ∇u) * dΩ
end
end
end

## Assemble Ke into Kgpu ##
assemble!(assembler, celldofs(dh,e),SMatrix(ke),SVector(fe)) # when passin mutable objects, throws and error

return nothing
end



Adapt.@adapt_structure Ferrite.GPUGrid
Adapt.@adapt_structure Ferrite.GPUDofHandler
Adapt.@adapt_structure Ferrite.GPUAssemblerSparsityPattern

#=NVTX.@annotate=# function assemble_global_gpu_color(cellvalues,dh,colors)
K = allocate_matrix(SparseMatrixCSC{Float32, Int32},dh)
Kgpu = CUSPARSE.CuSparseMatrixCSC(K)
fgpu = CUDA.zeros(ndofs(dh))
assembler = start_assemble(Kgpu, fgpu)
n_colors = length(colors)
# set up kernel adaption & launch the kernel
dh_gpu = Adapt.adapt_structure(CuArray, dh)
assembler_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), assembler)
cellvalues_gpu = Adapt.adapt_structure(CuArray, cellvalues)
for i in 1:n_colors
kernel = @cuda launch=false assemble_element_gpu!(assembler_gpu,cellvalues_gpu,dh_gpu,Int32(length(colors[i])),cu(colors[i]))
#@show CUDA.registers(kernel)
config = launch_configuration(kernel.fun)
threads = min(length(colors[i]), config.threads)
blocks = cld(length(colors[i]), threads)
kernel(assembler_gpu,cellvalues,dh_gpu,Int32(length(colors[i])),cu(colors[i]); threads, blocks)
end
return Kgpu,fgpu
end


# an alternative way to call the kernel using a macro
function assemble_global_gpu_color_macro(cellvalues,dh,colors)
K = allocate_matrix(SparseMatrixCSC{Float32, Int32},dh)
Kgpu = CUSPARSE.CuSparseMatrixCSC(K)
fgpu = CUDA.zeros(ndofs(dh))
assembler = start_assemble(Kgpu, fgpu)

# set up kernel adaption & launch the kernel
@run_gpu(assemble_element_gpu!, assembler, cellvalues, dh, colors)
return Kgpu,fgpu
end





stassy(cv,dh) = assemble_global!(cv,dh,Val(false))




# qpassy(cv,dh) = assemble_global!(cv,dh,Val(true))

Kgpu, fgpu = @btime CUDA.@sync assemble_global_gpu_color($cellvalues,$dh,colors);
#Kgpu, fgpu = CUDA.@profile assemble_global_gpu_color(cellvalues,dh,colors)
# to benchmark the code using nsight compute use the following command: ncu --mode=launch julia
# Open nsight compute and attach the profiler to the julia instance
# ref: https://cuda.juliagpu.org/v2.2/development/profiling/#NVIDIA-Nsight-Compute
# to benchmark using nsight system use the following command: # nsys profile --trace=nvtx julia rmse_kernel_v1.jl


#mKgpu, mfgpu = assemble_global_gpu_color_macro(cellvalues,dh,colors)



norm(Kgpu)


#Kstd , Fstd = @btime stassy($cellvalues,$dh);
Kstd , Fstd = stassy(cellvalues,dh);
norm(Kstd)

@testset "GPU Heat Equation" begin

for i = 1:10
# Bottom left point in the grid in the physical coordinate system.
# Generate random Float32 between -100 and -1
bl_x = rand(Float32) * (-99) - 1
bl_y = rand(Float32) * (-99) - 1

# Top right point in the grid in the physical coordinate system.
# Generate random Float32 between 0 and 100
tr_x = rand(Float32) * 100
tr_y = rand(Float32) * 100

n_x = rand(1:100) # number of cells in x direction
n_y = rand(1:100) # number of cells in y direction

left = Tensor{1,2,Float32}((bl_x,bl_y)) # define the left bottom corner of the grid.
right = Tensor{1,2,Float32}((tr_x,tr_y)) # define the right top corner of the grid.


grid = generate_grid(Quadrilateral, (n_x, n_y),left,right)


colors = create_coloring(grid) .|> (x -> Int32.(x)) # convert to Int32 to reduce number of registers


ip = Lagrange{RefQuadrilateral, 1}() # define the interpolation function (i.e. Bilinear lagrange)


qr = QuadratureRule{RefQuadrilateral,Float32}(2)


cellvalues = CellValues(Float32,qr, ip)


dh = DofHandler(grid)



add!(dh, :u, ip)

close!(dh);
# The CPU version:
Kstd , Fstd = stassy(cellvalues,dh);

# The GPU version
Kgpu, fgpu = assemble_global_gpu_color(cellvalues,dh,colors)

@test norm(Kstd) ≈ norm(Kgpu) atol=1e-4
end
end
Loading
Loading