Master to be -> master (#16)

* clean up types * clean up types * rm AbstracDitcs * rm Abstract from Matrix * rm AbstracDitcs and add zeros * clean up tensors more * clean up sparse * clean up virtual * clean up linear alg ext * simplify lin alg ext * comments * transpose * corewct types * simplify more * transpose * clean up #1 * clean up #2 * clean up #3 * clean up #4 * add canonise.jl * formatt code * clean up #5 * remove trans * add new func * clean up * start cleaning gauges * clean up * add comments * output env variational * MpoTensor * structs * MpoTensor * fix issues with types * fix nothing type * length for MpoTensor to be verified * clean up * fix subtyping in tensors * clean up types * add N explicitte in some places * rm lenght * MpoTensor * transpose * clea up * ideas * mv ideas to attic * make MpoTensor external const * renaming * add tsvd example * zipper * improve bench * clean up exports * add comments * add basic tests for QMps and Canonise * clean up stuff * measure_memory * add measure_memory * add format_bytes * zipper * sparse zipper * add eltype for mps / mps * fix some issues * zipper psvd * zipper psvd * zipper sparse svd * tsvd * change svd_corner_matrix * correct Adjoint for CornerTenso * to cuda * device * device * edvice * diag * fix Diagona * fix device * fix array 2 * fix diag 3 * works * add comments * toward cuda * fix kwargs * cuda contractons * cuda central * cleanup cuda * patch array centraltensor * patching dense central tensor * vector memory * start cleaning * add GPU flags, clean up * clean up * zipper QMps on GPU * start working on qr * qr on GPU works * add basic tests for QMps * more basic tests * aux * change site * clean up * clean up CPU <-> GPU transfer * gauges on gpu * change types in gauges * reordering * clean up * fix bug * clean up * fix * rm unnecassry permusts * dense site * clean up * clean up * zipper central * virtual * test * central * cuproj * aux_saprse * add @view in site * clean up * add bench for multi gpu * add examples * add commenst * add @inbounds in some places * towards cpu contractions * to cpu * benchmarks for memoizations of cusparse * added comment explaining convention * SparseCSC * move stuff to attic * clean up a bit * clean types a bit * added measure_memory for memoization caches * added handling of sparse matrices * hotfix in sparse matix CSR * clean up * clean up 2 * fix types * sitetensor * SiteTensor Sparse * VirtualTensor * poolofprojectors * memory * copied from "leg-ordering" branch * start working on SCR problem * project_ket_on_bra virtual * add SparseArrays to test cuSparse * virtual update_env_left * add CUDA.:* (commented) * virtual update_env_right * cirtual update_reduced_env_right * add bench for CSR * add bench * site * info * reduced_env_right sitetensor * measure_spectrum * input mps is in the correct canonical form * test measure_spectrum * virtual * test measure spectrum * move_to_CPU * schmidt * view virtual * rand qmpo * toward allocation * allocate site * site working * speed up virtual (update_env_right) * virtual alloc * virtual no cusparse * virtual no cusparse * proj * site * test * working site * cleanup * new zipper * new zipper * stable zipper * building blocks of new zipper-variational * fix type * fix typo * start new zipper * new env * new zipper * new_zipper clear env * kill attic * fix new zipper * args in zipper * clean up * repeat psvd * sparse * corner matrix for virtual * virtualtest * restart virtual * env_levt_v1 * new virtual * my_batched_mul * virtual with 2 step projectors * WIP * cases without central tensor * fix update_reduced_env_right virtual * central batched_mul * measure_memory(EnvironmentMixed) * change fg to cl_h * PoolOfProjectors in clustered_hamiltonian * clean up tests * split long lines * add docs * docs * clean up toml * clean up * add the docs * add flag for RMF in zipper * add docs * add depth of sweep in zipper * Rename aux.jl to utils.jl, becouse aux.* is restricted filename in windows * Update SpinGlassTensors.jl, changed aux to utils * moved projectors.jl from SpinGlassNetworks * remove networks from Project * moved test * add projectors to tests * update julia * added projectors.jl to runtest * moved rank_reveal from SpinGlassNetworks * bugfix * update to CUDA 4.4.1, TensorOperations 4 and TensorCast 0.4. SEE README * reformat * fix ci * onlcy set self-hosted * fix nothing comparsion (#18) * Fixes creation of sparse matrices (#19) * change CUDA sparse to CSR, rename function createing sparse matrices * up * update ci runner * fix runs-on ci * fix ci * set proper rev * add flags * fix project in TransmuteDims * fix typo --------- Co-authored-by: bartekGardas <[email protected]> Co-authored-by: marekrams <[email protected]> Co-authored-by: annamariadziubyna <[email protected]> Co-authored-by: annamariadziubyna <[email protected]> Co-authored-by: Łukasz Pawela <[email protected]> Co-authored-by: Łukasz Pawela <[email protected]>
euro-hpc-pl · Apr 3, 2024 · 2d761c2 · 2d761c2
1 parent 7372b46
commit 2d761c2
Show file tree

Hide file tree

Showing 64 changed files with 3,840 additions and 1,227 deletions.
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -4,25 +4,21 @@ on:
   - pull_request
 jobs:
   test:
-    name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }}
-    runs-on: ${{ matrix.os }}
+    name: Julia ${{ matrix.version }}
+    runs-on: [self-hosted,titan,gpu]
     strategy:
       fail-fast: false
       matrix:
         version:
-          - '1.7'
-          - '1.8'
-        os:
-          - ubuntu-latest
-          - macOS-latest
-        arch:
-          - x64
+          - '1.9'
+          - '1.10'
     steps:
       - uses: actions/checkout@v2
       - uses: julia-actions/setup-julia@v1
         with:
           version: ${{ matrix.version }}
-          arch: ${{ matrix.arch }}
+      - name: Fix TransmuteDims
+        run: julia --project=@. --color=yes -e 'using Pkg; Pkg.add(name="TransmuteDims", rev="strided2")'
       - uses: julia-actions/julia-buildpkg@latest
       - uses: julia-actions/julia-runtest@latest
         env:

diff --git a/Project.toml b/Project.toml
@@ -1,25 +1,34 @@
 name = "SpinGlassTensors"
 uuid = "7584fc6a-5a23-4eeb-8277-827aab0146ea"
-authors = [
-    "Łukasz Pawela <[email protected]>",
-    "Konrad Jałowiecki <[email protected]>",
-    "Bartłomiej Gardas <[email protected]>"
-    ]
-version = "0.3.0"
+authors = ["Anna Maria Dziubyna <[email protected]>", "Tomasz Śmierzchalski <[email protected]>", "Bartłomiej Gardas <[email protected]>", "Konrad Jałowiecki <[email protected]>", "Łukasz Pawela <[email protected]>", "Marek M. Rams <[email protected]>"]
+version = "1.0.0"
 
 [deps]
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
-Memoize = "c03570c3-d221-55d1-a50c-7939bbd78826"
+LowRankApprox = "898213cb-b102-5a47-900c-97e73b919f73"
+MKL = "33e6dc65-8f57-5167-99aa-e5a354878fb2"
+Memoization = "6fafb56a-5788-4b4e-91ca-c0cea6611c73"
+NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
+SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
+TSVD = "9449cd9e-2762-5aa3-a617-5413e99d722e"
 TensorCast = "02d47bb6-7ce6-556a-be16-bb1710789e2b"
 TensorOperations = "6aa20fa7-93e2-5fca-9bc0-fbd0db3c71a2"
+TransmuteDims = "24ddb15e-299a-5cc3-8414-dbddc482d9ca"
+cuTENSOR = "011b41b2-24ef-40a8-b3eb-fa098493e9e1"
 
 [compat]
-DocStringExtensions = "0.8"
-Memoize = "0.4"
+CUDA = "4.4.1"
+DocStringExtensions = "0.9.3"
+LowRankApprox = "0.5.5"
+MKL = "0.4.2"
+Memoization = "0.2.1"
+SparseArrays = "1.9"
 TensorCast = "0.4"
-TensorOperations = "3.0.1"
-julia = "1.7, 1.8"
+TensorOperations = "4"
+cuTENSOR = "1.1.0"
+julia = "1.9, 1.10"
 
 [extras]
 Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"

diff --git a/README.md b/README.md
@@ -1,2 +1,3 @@
 [![Coverage Status](https://coveralls.io/repos/github/iitis/SpinGlassTensors.jl/badge.svg?branch=master)](https://coveralls.io/github/iitis/SpinGlassTensors.jl?branch=master)
 # SpinGlassTensors.jl
+This works with CUDA v4.4.1. You need to manually `]add TransmuteDims#strided2`
diff --git a/bench_mm.jl b/bench_mm.jl
@@ -0,0 +1,29 @@
+using TensorCast, TensorOperations
+function time_mm()
+    M = rand(100, 100, 100)
+    L = rand(100, 100)
+    R = rand(100, 100)
+    @time begin
+        @matmul M1[x, σ, α] := sum(β) L[x, β] * M[β, σ, α]
+        @matmul MM[x, σ, y] := sum(α) M1[x, σ, α] * R[α, y]
+    end
+end
+
+function time_tensor()
+    M = rand(100, 100, 100)
+    L = rand(100, 100)
+    R = rand(100, 100)
+
+    @time begin
+        @tensor M̃[x, σ, y] := L[x, β] * M[β, σ, α] * R[α, y] order = (α, β)
+        # @cast B[(x, σ), y] |= M̃[x, σ, y]
+    end
+end
+
+println("matmul")
+time_mm()
+time_mm()
+
+println("\n tensor")
+time_tensor()
+time_tensor()
diff --git a/benchmark/args.jl b/benchmark/args.jl
@@ -0,0 +1,12 @@
+using LinearAlgebra
+
+function my_svd(A; kwargs...)
+    svd(A; kwargs...)
+end
+
+
+T = Float64
+n = 2
+A = rand(T, 2, 2)
+
+my_svd(A, full = true)
diff --git a/benchmark/cuda_matrix_mul.jl b/benchmark/cuda_matrix_mul.jl
@@ -0,0 +1,29 @@
+using CUDA
+using LinearAlgebra
+
+CUDA.allowscalar(false)
+
+nnz = 100
+Val = CUDA.rand(Float64, nnz)
+Ptr = CuArray(1:nnz+1)
+Ind = CuArray(rand(1:100, nnz))
+
+A = CUDA.CUSPARSE.CuSparseMatrixCSR(Ptr, Ind, Val, (100, 100))
+B = CUDA.rand(Float64, 100, 100)
+C = CUDA.CUSPARSE.CuSparseMatrixCSC(Ptr, Ind, Val, (100, 100))
+
+A * B # no scalar indexing
+CUDA.@allowscalar B * A # scalar indexing
+
+C * B # no scalar indexing
+CUDA.@allowscalar B * C # scalar indexing
+
+A' * B # no scalar indexing
+CUDA.@allowscalar B * A' # scalar indexing
+
+transpose(A) * B # no scalar indexing
+CUDA.@allowscalar B * transpose(A) # scalar indexing
+# problem is when we multiply dense x sparse
+
+D = rand(Float64, (100, 100))
+CUDA.@allowscalar D * A # scalar indexing
diff --git a/benchmark/gpu_slicing.jl b/benchmark/gpu_slicing.jl
@@ -0,0 +1,15 @@
+using CUDA
+
+T = Float64
+n = 10000
+k = 500
+
+a = CUDA.rand(T, n, n)
+p = reverse(collect(1:k))
+p_d = CuArray(p)
+
+@time A = a[:, p]
+@time @inbounds A = a[:, p]
+@time A = a[:, p_d]
+@time @inbounds A = a[:, p_d]
+nothing
diff --git a/benchmark/memoization_cusparse.jl b/benchmark/memoization_cusparse.jl
@@ -0,0 +1,46 @@
+using Memoization
+using LinearAlgebra
+using CUDA
+using BenchmarkTools
+
+# Functions from constactions_cuda/sparse.jl which are not exported
+
+@memoize Dict function aux_cusparse(::Type{R}, n::Int64) where {R<:Real}
+    println("entering aux_cusparse function")
+    CuArray(1:n+1), CUDA.ones(R, n)
+end
+
+@memoize Dict function CUDA.CUSPARSE.CuSparseMatrixCSC(
+    ::Type{R},
+    p::Vector{Int},
+) where {R<:Real}
+    println("entering cusparse")
+    n = length(p)
+    cn, co = aux_cusparse(R, n)
+    CUDA.CUSPARSE.CuSparseMatrixCSC(cn, CuArray(p), co, (maximum(p), n))
+end
+
+
+function CuSparseMatrixCSC_no_memo(::Type{R}, p::Vector{Int}) where {R<:Real}
+    println("entering no memo")
+    n = length(p)
+    cn, co = aux_cusparse(R, n)
+    CUDA.CUSPARSE.CuSparseMatrixCSC(cn, CuArray(p), co, (maximum(p), n))
+end
+
+# test of their memoization
+
+p = sort(rand(1:5000, 10000000))
+p2 = sort(rand(1:5000, 10000000))
+@time A = CuSparseMatrixCSC_no_memo(Float64, p)
+@time B = CuSparseMatrixCSC_no_memo(Float64, p)
+
+@time C = CUDA.CUSPARSE.CuSparseMatrixCSC(Float64, p) # compilation time?
+
+@time D = CUDA.CUSPARSE.CuSparseMatrixCSC(Float64, p)
+@time E = CUDA.CUSPARSE.CuSparseMatrixCSC(Float64, p2)
+@time F = CUDA.CUSPARSE.CuSparseMatrixCSC(Float64, p2)
+CUDA.memory_status()
+Memoization.empty_all_caches!()
+CUDA.memory_status()
+# clearing memoization caches doeas not free GPU memory
diff --git a/benchmark/memoization_test.jl b/benchmark/memoization_test.jl
@@ -0,0 +1,38 @@
+using SpinGlassTensors
+using Memoization
+using CUDA
+
+
+@memoize Dict function example_cuda_array(::Type{R}, size::Int64) where {R<:Real}
+    CUDA.rand(R, (size, size))
+end
+
+
+@memoize Dict function example_array(::Type{R}, size::Int64) where {R<:Real}
+    rand(R, size, size)
+end
+
+
+@memoize Dict function aux_cusparse(::Type{R}, n::Int64) where {R<:Real}
+    CuArray(1:n+1), CUDA.ones(R, n)
+end
+
+
+@memoize Dict function CUDA.CUSPARSE.CuSparseMatrixCSC(
+    ::Type{R},
+    p::Vector{Int},
+) where {R<:Real}
+    n = length(p)
+    cn, co = aux_cusparse(R, n)
+    CUDA.CUSPARSE.CuSparseMatrixCSC(cn, CuArray(p), co, (maximum(p), n))
+end
+
+
+A = example_cuda_array(Float64, 10000)
+B = example_cuda_array(Float64, 1100)
+C = example_array(Float64, 1000)
+p = rand(1:5000, 100000000)
+D = CUDA.CUSPARSE.CuSparseMatrixCSC(Float64, p)
+CUDA.memory_status()
+println("/n")
+measure_memory(Memoization.caches)
diff --git a/benchmark/mulit_gpu.jl b/benchmark/mulit_gpu.jl
@@ -0,0 +1,25 @@
+using CUDA
+
+function move_to_CUDA(a::Array{T,N}) where {T,N}
+    buf_a = Mem.alloc(Mem.Unified, sizeof(a))
+    d_a = unsafe_wrap(CuArray{T,N}, convert(CuPtr{T}, buf_a), size(a))
+    finalizer(d_a) do _
+        Mem.free(buf_a)
+    end
+    copyto!(d_a, a)
+    d_a
+end
+
+T = Float64
+n = 100
+gpus = Int(length(devices()))
+
+a = rand(T, n, n, gpus)
+a_d = move_to_CUDA(a)
+
+for (gpu, dev) ∈ enumerate(devices())
+    device!(dev)
+    @views a_d[:, :, gpu] .= 2 .* a_d[:, :, gpu]
+end
+
+a_d
diff --git a/benchmark/mulit_gpu2.jl b/benchmark/mulit_gpu2.jl
@@ -0,0 +1,10 @@
+using CUDA
+
+T = Float64
+n = 100
+gpus = Int(length(devices()))
+
+a = rand(T, n, n, gpus)
+a_d = cu(a, unified = true)
+
+a_d
diff --git a/benchmark/psvd.jl b/benchmark/psvd.jl
@@ -0,0 +1,86 @@
+using LinearAlgebra, MKL
+using TensorOperations
+using TensorCast
+using TSVD
+using LowRankApprox
+using RandomizedLinAlg
+using FameSVD
+
+N = 100
+cut = 8
+
+mat = rand(100, 100);
+U, S, V = svd(mat);
+S = exp.(collect(0:N-1) * log(4 / 5));
+
+mat = U * Diagonal(S) * V';
+U, S, V = svd(mat);
+
+U, S, V = U[:, 1:cut], S[1:cut], V[:, 1:cut]
+mat1 = U * Diagonal(S) * V'
+println(S[1:cut])
+println(norm(mat - mat1))
+
+Up, Sp, Vp = psvd(mat, rank = 2 * cut)
+
+mat2 = Up[:, 1:cut] * Diagonal(Sp[1:cut]) * Vp[:, 1:cut]'
+
+println(Sp[1:cut])
+println(Sp[1:cut] - S[1:cut])
+println(norm(mat - mat2))
+
+# Vp = V
+
+C = mat * Vp
+println(size(C))
+Ut, _ = qr(C)
+Ut = Ut[:, 1:cut]
+println(size(Ut))
+C = Ut' * mat
+Vp, _ = qr(C')
+Vp = Vp[:, 1:cut]
+
+
+
+C = mat * Vp
+Uf, Sf, Vf = svd(C);
+Uf, Sf, Vf = Uf[:, 1:cut], Sf[1:cut], Vf[:, 1:cut]
+mat3 = Uf * Diagonal(Sf) * Vf' * Vp'
+println(Sf - S[1:cut])
+println(norm(mat - mat3))
+
+nothing
+
+
+iter = 5
+Up, Sp, Vp = [], [], []
+for i = 1:iter
+    Utemp, Stemp, Vtemp = psvd(mat, rank = 2 * cut)
+    push!(Up, Utemp)
+    push!(Sp, Stemp)
+    push!(Vp, Vtemp)
+end
+
+Ups = hcat(Up...)
+Vps = hcat(Vp...)
+Sps = vcat(Sp...) / iter
+println(size(Ups), " ", size(Vps), " ", size(Sps))
+println(size(Up[1]), " ", size(Vp[1]), " ", size(Sp[1]))
+
+Uq, Ur = qr(Ups)
+Vq, Vr = qr(Vps)
+
+Ut, St, Vt = svd(Ur * Diagonal(Sps) * Vr')
+
+U2 = Uq * Ut[:, 1:cut]
+V2 = Vq * Vt[:, 1:cut]
+S2 = St[1:cut]
+println(St)
+println(S2)
+
+mat4 = U2 * Diagonal(S2) * V2'
+
+
+println(norm(mat1 - mat2))
+println(norm(mat1 - mat3))
+println(norm(mat1 - mat4))