Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try CuPy version of SuperLU. #34

Closed
loganbvh opened this issue Sep 8, 2023 · 12 comments
Closed

Try CuPy version of SuperLU. #34

loganbvh opened this issue Sep 8, 2023 · 12 comments

Comments

@loganbvh
Copy link
Owner

loganbvh commented Sep 8, 2023

https://docs.cupy.dev/en/stable/reference/generated/cupyx.scipy.sparse.linalg.factorized.html

The LU factorization is done on the CPU, but the linear solve is done on the GPU.

@loganbvh
Copy link
Owner Author

loganbvh commented Sep 19, 2023

@Kyroba I have opened a Pull Request (#36) in which I have implemented GPU acceleration via the CuPy library.

If you are interested, it would be great if you could try out this new version of tdgl to see if using the GPU improves performance. This would involve creating a new conda environment for testing, installing the appropriate version of tdgl from GitHub, and installing the appropriate version of cupy for your version of CUDA (see instructions here: https://docs.cupy.dev/en/stable/install.html#installing-cupy). You can find your version of CUDA using nvcc --version.

pip install git+https://github.com/loganbvh/py-tdgl.git@gpu
pip install cupy-cuda11x # for example, for CUDA v11.2 ~ 11.8

For more details, see https://py-tdgl--36.org.readthedocs.build/en/36/installation.html#gpu-acceleration.

@Kyroba
Copy link

Kyroba commented Sep 20, 2023

@loganbvh I was meaning to look into this but got distracted by experiments! I will do this tomorrow, thank you for following up on it!

@Kyroba
Copy link

Kyroba commented Sep 21, 2023

I could not get snakeviz to work, so the only comparison I can provide is total simulation time. I will post the simulation runtimes tomorrow morning. Judging from how it looks in the first half of the simulation there doesn't seem to be much improvement.

The GPU is definitely being used for the simulation:
image
image

One issue I have encountered whilst using the sparse_solver="cupy" option is that the addition of parameters isn't handled the same when using the default sparse_solver. Combining a current loop and a uniform field doesn't work as easily and results in this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 18
      4 options = tdgl.SolverOptions(
      5     solve_time=600,
      6     sparse_solver="cupy",
   (...)
      9     current_units="uA",
     10 )
     12 uni_parameter = ConstantField(
     13     0.6,  # Set your current value here
     14     field_units="mT",
     15     length_units="um",   # Set your length unit, default is "um"
     16 )
---> 18 zero_current_solution = tdgl.solve(
     19     device,
     20     options,
     21     # If applied_vector_potential is given as a single number,
     22     # it is interpreted to mean the vector potential associated with a
     23     # uniform out-of-plane magnetic field with the specified strength.
     24     applied_vector_potential = uni_parameter + loop_parameter,
     25 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tdgl\solver\solve.py:139, in solve(device, options, applied_vector_potential, terminal_currents, disorder_epsilon, seed_solution)
    133 vector_potential_scale = (
    134     (ureg(field_units) * length_units / (Bc2 * xi * length_units))
    135     .to_base_units()
    136     .magnitude
    137 )
    138 A_kwargs = dict(t=0) if time_dependent_vector_potential else dict()
--> 139 vector_potential = applied_vector_potential_(x, y, z, **A_kwargs)
    140 vector_potential = np.asarray(vector_potential)[:, :2]
    141 if vector_potential.shape != x.shape + (2,):

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tdgl\parameter.py:309, in CompositeParameter.__call__(self, x, y, z, t)
    307         value = operand(x, y, z, **kwargs)
    308     else:
--> 309         value = operand(x, y, z)
    310 else:
    311     value = operand

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tdgl\parameter.py:309, in CompositeParameter.__call__(self, x, y, z, t)
    307         value = operand(x, y, z, **kwargs)
    308     else:
--> 309         value = operand(x, y, z)
    310 else:
    311     value = operand

    [... skipping similar frames: CompositeParameter.__call__ at line 309 (52 times)]

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tdgl\parameter.py:309, in CompositeParameter.__call__(self, x, y, z, t)
    307         value = operand(x, y, z, **kwargs)
    308     else:
--> 309         value = operand(x, y, z)
    310 else:
    311     value = operand

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tdgl\parameter.py:146, in Parameter.__call__(self, x, y, z, t)
    139 def __call__(
    140     self,
    141     x: Union[int, float, np.ndarray],
   (...)
    144     t: Optional[float] = None,
    145 ) -> Union[int, float, np.ndarray]:
--> 146     cache_key = self._hash_args(x, y, z, t)
    147     if cache_key not in self._cache:
    148         kwargs = self.kwargs.copy()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\tdgl\parameter.py:132, in Parameter._hash_args(self, x, y, z, t)
    130 def _hash_args(self, x, y, z, t) -> str:
    131     return (
--> 132         hex(hash(tuple(self.kwargs.items())))
    133         + hashlib.sha1(np.ascontiguousarray(x)).hexdigest()
    134         + hashlib.sha1(np.ascontiguousarray(y)).hexdigest()
    135         + hashlib.sha1(np.ascontiguousarray(z)).hexdigest()
    136         + hex(hash(t))
    137     )

TypeError: unhashable type: 'numpy.ndarray'

Some extra information:
image, image

image
image

@Kyroba
Copy link

Kyroba commented Sep 21, 2023

Loop_parameter is derived from the following:

def generate_dots(n):
    # Calculate approximate number of dots along x and y based on square root of n
    nx = int(np.round(np.sqrt(n * (60/36))))
    ny = int(np.round(n/nx))
    
    # Generate evenly spaced coordinates for x and y
    x = np.linspace(-30, 30, nx)
    y = np.linspace(-18, 18, ny)
    
    # Generate a 2D grid of x and y coordinates
    xx, yy = np.meshgrid(x, y)
    
    # Flatten and stack these coordinates, and add z=0 plane
    positions = np.vstack((xx.ravel(), yy.ravel(), np.zeros_like(xx).ravel())).T
    
    return positions[:n]  # Trim to the desired number of points

n = 54  # Adjust this value as needed
positions = generate_dots(n)

from tdgl.sources import CurrentLoop

# Define constants
current = 10000
radius = 0.5
length_units = "um"

loop_parameter = 0

# Sum the potentials for the centers
for center in positions:
    loop_parameter += CurrentLoop(
        current=current,
        radius=radius,
        center=center,
        length_units=length_units
    )

Essentially using an array of positions for each loop center. This composite parameter works when not using the CuPy sparse solver.

@loganbvh
Copy link
Owner Author

Thanks for trying this, @Kyroba!

As a workaround for the Parameter issue, I think you can explicitly convert the loop center to a tuple to make it hashable:

loop_parameter = 0
# Sum the potentials for the centers
for center in positions:
    loop_parameter += CurrentLoop(
        current=current,
        radius=radius,
        center=tuple(center),
        length_units=length_units
    )

To be honest, I am not sure why that Parameter was working for sparse_solver != "cupy". Is the uni_parameter time-dependent?

It's disappointing that the performance is not much better. I will continue testing to see if I can improve things. If you can't get snakeviz to work, you can try the %%prun magic command in Jupyter for profiling (https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-prun). Profiling code that uses the GPU can be tricky since GPU operations are not synchronous with the CPU.

One thing you could try is to set options.save_every to a very large value. Every time the results are saved, quite a lot of data has to be transferred from the GPU to CPU, which can take a long time especially for a big mesh.

@loganbvh
Copy link
Owner Author

Also as a general rule, if you want to sum together many field sources, it will be more efficient to write a Python function that does the sum and then wrap that function in a single Parameter, rather than summing together many Parameters. For example:

from tdgl.em import current_loop_vector_potential

def many_loops_vector_potential(
    x, y, z, *,
    loop_centers,
    loop_currents,
    loop_radii,
    current_units="uA",
    field_units="mT",
    length_units="um",
):
    if z.ndim == 0:
        z = z * np.ones_like(x)
    positions = np.array([x.squeeze(), y.squeeze(), z.squeeze()]).T

    if isinstance(loop_currents, (int, float)):
        loop_currents = loop_currents * np.ones(len(loop_centers))
    if isinstance(loop_radii, (int, float)):
        loop_radii = loop_radii * np.ones(len(loop_centers))

    assert len(loop_currents) == len(loop_centers)
    assert len(loop_radii) == len(loop_centers)
    
    A_total = np.zeros((len(x), 3), dtype=float)
    for current, center, radius in zip(loop_currents, loop_centers, loop_radii):
        A_loop = current_loop_vector_potential(
            positions,
            loop_center=center,
            loop_radius=radius,
            current=current,
            current_units=current_units,
            length_units=length_units,
        )
        A_total += A_loop.to(f"{field_units} * {length_units}").magnitude
        
    return A_total

# Define constants
current = 10000
radius = 0.5
length_units = "um"
n = 54  # Adjust this value as needed
positions = generate_dots(n)

loop_parameter = tdgl.Parameter(
    many_loops_vector_potential,
    loop_centers=positions,
    loop_currents=current,
    loop_radii=radius,
    length_units=length_units,
)

A_applied = uni_parameter + loop_parameter

@Kyroba
Copy link

Kyroba commented Sep 22, 2023

Here are the total times from the same simulation:

GPU
image

CPU
image

So, the GPU performance seems slower but perhaps it is better for an extremely large geometry? I can try the suggestions you have provided and test it again.

Thank you for the solutions to the composite parameter!

@Kyroba
Copy link

Kyroba commented Sep 22, 2023

Increasing options.save_every did not seem to improve simulation time but did save on file size by a lot which is expected (from 15gb to 250mb).
image

%%prun

from tdgl.sources import CurrentLoop
from tdgl.sources import ConstantField

options = tdgl.SolverOptions(
    save_every = 100000,
    solve_time=600,
    sparse_solver="cupy",
    output_file=os.path.join("Z:/Archive/TDGL Sims/GPU Testing", "GPU.h5"),
    field_units = "mT",
    current_units="uA",
)

uni_parameter = ConstantField(
    0.6,  # Set your current value here
    field_units="mT",
    length_units="um",   # Set your length unit, default is "um"
)

zero_current_solution = tdgl.solve(
    device,
    options,
    # If applied_vector_potential is given as a single number,
    # it is interpreted to mean the vector potential associated with a
    # uniform out-of-plane magnetic field with the specified strength.
    applied_vector_potential = uni_parameter,
)

And the output from %prun, with most of it cut-off since total time for those is 0.0:

        100471571 function calls (100363831 primitive calls) in 8217.414 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   107614 7822.532    0.073 7837.104    0.073 cusparse.py:1513(csrsm2)
    53807  132.088    0.002  144.306    0.003 operators.py:337(get_supercurrent)
    53807   63.331    0.001   82.351    0.002 euler.py:15(solve_for_psi_squared)
    53807   30.391    0.001 8150.176    0.151 solve.py:243(update)
   159761   21.486    0.000   21.486    0.000 socket.py:621(send)
   425995   12.625    0.000   12.625    0.000 {method 'acquire' of '_thread.lock' objects}
    53807    9.106    0.000 7851.132    0.146 _solve.py:548(solve)
   269035    6.586    0.000    6.586    0.000 {built-in method cupy_backends.cuda.libs.cusparse.spMV}
   269035    5.537    0.000   32.703    0.000 cusparse.py:1370(spmv)
   161421    5.301    0.000    5.451    0.000 runner.py:183(append)
   107614    4.936    0.000    4.936    0.000 {built-in method cupy_backends.cuda.libs.cusparse.dcsrsm2_analysis}
   269041    4.403    0.000    4.403    0.000 {method 'memset_async' of 'cupy.cuda.memory.MemoryPointer' objects}
    53807    3.740    0.000    3.740    0.000 {method 'any' of 'cupy._core.core._ndarray_base' objects}
   376649    3.518    0.000    3.518    0.000 basic.py:7(empty)
   215228    3.457    0.000   33.677    0.000 _csr.py:148(__mul__)
 15761835    3.236    0.000    4.567    0.000 utils.py:374(<genexpr>)
   107614    3.151    0.000    3.151    0.000 {built-in method cupy_backends.cuda.libs.cusparse.dcsrsm2_solve}
    53807    3.013    0.000    3.013    0.000 {method 'max' of 'cupy._core.core._ndarray_base' objects}
1237795/1130146    2.875    0.000    6.755    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}

@loganbvh
Copy link
Owner Author

Thanks @Kyroba. cusparse csrsm2 is the CUDA routine for solving a linear system given an LU factorization, so it makes sense that this takes the most time. However I'm still surprised it is slower than the CPU on a problem of this size. I'll continue testing on my end and let you know if I figure anything out. Thanks again for your help!

@loganbvh
Copy link
Owner Author

(Writing some notes here for myself.)

Testing on Google Colab at 0d45f3f using https://github.com/pyutils/line_profiler.

Test cases

  1. gpu = False, sparse_solver = "superlu". This is the standard case where everything goes through numpy/scipy on the CPU.
  2. gpu = True, sparse_solver = "superlu". Here, everything except the actual linear solve is performed on the GPU using CuPy. For the linear solve, the right-hand-side of the equation (the b in A @ x = b) is prepared on the GPU, then moved to CPU memory, solved using SuperLU on the CPU, then moved back to the GPU.
  3. gpu = True, sparse_solver = "cupy". Here, everything including the linear solve is performed on the GPU using CuPy. The CuPy version of scipy.sparse.linalg.factorized() calls cusparse.csrsm2 to do the linear solve given an LU factorization. The LU factorization (which is done only once at the beginning of the simulation) is done using SuperLU on the CPU.

Test setup

"Quickstart" model with max_edge_length = xi / 5 = 0.5 / 5 (27502 mesh sites)
length_units = "um"
# Material parameters
xi = 0.5
london_lambda = 2
d = 0.1
layer = tdgl.Layer(coherence_length=xi, london_lambda=london_lambda, thickness=d, gamma=1)

# Device geometry
total_width = 5
total_length = 3.5 * total_width
link_width = total_width / 3
# Outer geometry of the film
right_notch = (
    tdgl.Polygon(points=box(total_width))
    .rotate(45)
    .translate(dx=(np.sqrt(2) * total_width + link_width) / 2)
)
left_notch = right_notch.scale(xfact=-1)
film = (
    tdgl.Polygon("film", points=box(total_width, total_length))
    .difference(right_notch, left_notch)
    .resample(801)
    .buffer(0)
)
# Holes in the film
round_hole = (
    tdgl.Polygon("round_hole", points=circle(link_width / 2))
    .translate(dy=total_length / 5)
    .resample(201)
)
square_hole = (
    tdgl.Polygon("square_hole", points=box(link_width))
    .rotate(45)
    .translate(dy=-total_length / 5)
    .resample(201)
)
# Current terminals
source = (
    tdgl.Polygon("source", points=box(1.1 * total_width, total_length / 100))
    .translate(dy=total_length / 2)
)
drain = source.scale(yfact=-1).set_name("drain")
#  Voltage measurement points
probe_points = [(0, total_length / 2.5), (0, -total_length / 2.5)]

device = tdgl.Device(
    "weak_link",
    layer=layer,
    film=film,
    holes=[round_hole, square_hole],
    terminals=[source, drain],
    probe_points=probe_points,
    length_units=length_units,
)
device.make_mesh(max_edge_length=xi / 5)

device.mesh_stats_dict() == {'num_sites': 27502,
 'num_elements': 53799,
 'min_edge_length': 0.023546873824451527,
 'max_edge_length': 0.09911161817920922,
 'mean_edge_length': 0.05888391564333439,
 'min_area': 0.0003082034156361785,
 'max_area': 0.005147875450603828,
 'mean_area': 0.0027992447651071306,
 'coherence_length': 0.5,
 'length_units': 'um'}

options = tdgl.SolverOptions(
solve_time=10,
field_units = "mT",
current_units="uA",
gpu=False,
sparse_solver="superlu",
)
kwargs = dict(
device=device,
options=options,
terminal_currents=dict(source=12, drain=-12)
)

%lprun -f tdgl.TDGLSolver.update -f tdgl.TDGLSolver.solve_for_psi_squared -f tdgl.TDGLSolver.solve_for_observables -s -u 1 tdgl.solve(**kwargs)

Results

1. gpu = False, sparse_solver = "superlu" (53 seconds total)

Profile of tdgl.TDGLSolver.update()
Total time: 51.7492 s
File: /usr/local/lib/python3.10/dist-packages/tdgl/solver/solver.py
Function: update at line 545

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   545                                               def update(
   546                                                   self,
   547                                                   state: Dict[str, Union[int, float]],
   548                                                   running_state: RunningState,
   549                                                   dt: float,
   550                                                   *,
   551                                                   psi: np.ndarray,
   552                                                   mu: np.ndarray,
   553                                                   supercurrent: np.ndarray,
   554                                                   normal_current: np.ndarray,
   555                                                   induced_vector_potential: np.ndarray,
   556                                                   applied_vector_potential: Optional[np.ndarray] = None,
   557                                               ) -> SolverResult:
   558                                                   """This method is called at each time step to update the state of the system.
   559                                           
   560                                                   Args:
   561                                                       state: The solver state, i.e., the solve step, time, and time step
   562                                                       running_state: A container for scalar data that is saved at each time step
   563                                                       dt: The time step for the previous solve step
   564                                                       psi: The order parameter
   565                                                       mu: The scalar potential
   566                                                       supercurrent: The supercurrent density
   567                                                       normal_current: The normal current density
   568                                                       induced_vector_potential: The induced vector potential
   569                                                       applied_vector_potential: The applied vector potential. This will be ``None``
   570                                                           in the case of a time-independent vector potential.
   571                                           
   572                                                   Returns:
   573                                                       A :class:`tdgl.SolverResult` instance for the solve step.
   574                                                   """
   575      3407          0.0      0.0      0.0          xp = self.xp
   576      3407          0.0      0.0      0.0          options = self.options
   577      3407          0.0      0.0      0.0          operators = self.operators
   578                                           
   579      3407          0.0      0.0      0.0          A_induced = induced_vector_potential
   580      3407          0.0      0.0      0.0          A_applied = applied_vector_potential
   581      3407          0.0      0.0      0.0          step = state["step"]
   582      3407          0.0      0.0      0.0          time = state["time"]
   583                                           
   584      3407          0.1      0.0      0.2          self.update_mu_boundary(time)
   585      3407          0.0      0.0      0.0          vector_potential = self.vector_potential
   586      3407          0.0      0.0      0.0          dA_dt = 0.0
   587      3407          0.0      0.0      0.0          if self.time_dependent_vector_potential:
   588                                                       vector_potential, dA_dt = self.update_applied_vector_potential(time, dt)
   589      3407          0.0      0.0      0.0          self.vector_potential = vector_potential
   590                                           
   591      3407          1.7      0.0      3.3          old_sq_psi = xp.absolute(psi) ** 2
   592      3407          0.0      0.0      0.0          screening_error = np.inf
   593      3407          0.0      0.0      0.0          A_induced_vals = []
   594      3407          0.0      0.0      0.0          velocity = [0]  # Velocity for Polyak's method
   595                                                   # This loop runs only once if options.include_screening is False
   596      3407          0.0      0.0      0.0          for screening_iteration in itertools.count():
   597      3407          0.0      0.0      0.0              if screening_error < options.screening_tolerance:
   598                                                           break  # Screening calculation converged.
   599      3407          0.0      0.0      0.0              if screening_iteration > options.max_iterations_per_step:
   600                                                           raise RuntimeError(
   601                                                               f"Screening calculation failed to converge at step {step} after"
   602                                                               f" {options.max_iterations_per_step} iterations. Relative error in"
   603                                                               f" induced vector potential: {screening_error:.2e}"
   604                                                               f" (tolerance: {options.screening_tolerance:.2e})."
   605                                                           )
   606      3407          0.0      0.0      0.0              if options.include_screening:
   607                                                           # Update the link variables in the covariant Laplacian and gradient
   608                                                           # for psi based on the induced vector potential from the previous iteration.
   609                                                           operators.set_link_exponents(vector_potential + A_induced)
   610                                                           if self.time_dependent_vector_potential:
   611                                                               dA_dt = xp.einsum(
   612                                                                   "ij, ij -> i",
   613                                                                   (vector_potential + A_induced - A_applied) / dt,
   614                                                                   self.normalized_directions,
   615                                                               )
   616                                           
   617                                                       # Adjust the time step and calculate the new the order parameter
   618      3407          0.0      0.0      0.0              if screening_iteration == 0:
   619                                                           # Find a new time step only for the first screening iteration.
   620      3407          0.0      0.0      0.0                  dt = self.tentative_dt
   621                                           
   622      6814         15.6      0.0     30.2              psi, abs_sq_psi, dt = self.adaptive_euler_step(
   623      3407          0.0      0.0      0.0                  step, psi, old_sq_psi, mu, dt
   624                                                       )
   625      3407         33.3      0.0     64.4              mu, supercurrent, normal_current = self.solve_for_observables(psi, dA_dt)
   626                                           
   627      3407          0.0      0.0      0.0              if options.include_screening:
   628                                                           A_induced, screening_error = self.get_induced_vector_potential(
   629                                                               supercurrent + normal_current, A_induced_vals, velocity
   630                                                           )
   631                                                       else:
   632      3407          0.0      0.0      0.0                  break
   633                                           
   634      3407          0.0      0.0      0.1          running_state.append("dt", dt)
   635      3407          0.0      0.0      0.0          if self.probe_points is not None:
   636                                                       # Update the voltage and phase difference
   637      3407          0.0      0.0      0.1              running_state.append("mu", mu[self.probe_points])
   638      3407          0.1      0.0      0.2              running_state.append("theta", xp.angle(psi[self.probe_points]))
   639      3407          0.0      0.0      0.0          if options.include_screening:
   640                                                       running_state.append("screening_iterations", screening_iteration)
   641                                           
   642      3407          0.0      0.0      0.0          if options.adaptive:
   643                                                       # Compute the max abs change in |psi|^2, averaged over the adaptive window,
   644                                                       # and use it to select a new time step.
   645      3407          0.3      0.0      0.5              self.d_psi_sq_vals.append(float(xp.absolute(abs_sq_psi - old_sq_psi).max()))
   646      3407          0.0      0.0      0.0              window = options.adaptive_window
   647      3407          0.0      0.0      0.0              if step > window:
   648      6792          0.0      0.0      0.0                  new_dt = options.dt_init / max(
   649      3396          0.2      0.0      0.4                      1e-10, np.mean(self.d_psi_sq_vals[-window:])
   650                                                           )
   651      3396          0.2      0.0      0.4                  self.tentative_dt = np.clip(0.5 * (new_dt + dt), 0, self.dt_max)
   652                                           
   653      3407          0.0      0.0      0.0          results = [dt, psi, mu, supercurrent, normal_current, A_induced]
   654      3407          0.0      0.0      0.0          if self.time_dependent_vector_potential:
   655                                                       results.append(vector_potential)
   656                                                   else:
   657      3407          0.0      0.0      0.0              results.append(None)
   658      3407          0.0      0.0      0.0          return SolverResult(*results)

2. gpu = True, sparse_solver = "superlu" (36 seconds total)

Profile of tdgl.TDGLSolver.update()
Total time: 35.7374 s
File: /usr/local/lib/python3.10/dist-packages/tdgl/solver/solver.py
Function: update at line 545

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   545                                               def update(
   546                                                   self,
   547                                                   state: Dict[str, Union[int, float]],
   548                                                   running_state: RunningState,
   549                                                   dt: float,
   550                                                   *,
   551                                                   psi: np.ndarray,
   552                                                   mu: np.ndarray,
   553                                                   supercurrent: np.ndarray,
   554                                                   normal_current: np.ndarray,
   555                                                   induced_vector_potential: np.ndarray,
   556                                                   applied_vector_potential: Optional[np.ndarray] = None,
   557                                               ) -> SolverResult:
   558                                                   """This method is called at each time step to update the state of the system.
   559                                           
   560                                                   Args:
   561                                                       state: The solver state, i.e., the solve step, time, and time step
   562                                                       running_state: A container for scalar data that is saved at each time step
   563                                                       dt: The time step for the previous solve step
   564                                                       psi: The order parameter
   565                                                       mu: The scalar potential
   566                                                       supercurrent: The supercurrent density
   567                                                       normal_current: The normal current density
   568                                                       induced_vector_potential: The induced vector potential
   569                                                       applied_vector_potential: The applied vector potential. This will be ``None``
   570                                                           in the case of a time-independent vector potential.
   571                                           
   572                                                   Returns:
   573                                                       A :class:`tdgl.SolverResult` instance for the solve step.
   574                                                   """
   575      3416          0.0      0.0      0.0          xp = self.xp
   576      3416          0.0      0.0      0.0          options = self.options
   577      3416          0.0      0.0      0.0          operators = self.operators
   578                                           
   579      3416          0.0      0.0      0.0          A_induced = induced_vector_potential
   580      3416          0.0      0.0      0.0          A_applied = applied_vector_potential
   581      3416          0.0      0.0      0.0          step = state["step"]
   582      3416          0.0      0.0      0.0          time = state["time"]
   583                                           
   584      3416          0.1      0.0      0.2          self.update_mu_boundary(time)
   585      3416          0.0      0.0      0.0          vector_potential = self.vector_potential
   586      3416          0.0      0.0      0.0          dA_dt = 0.0
   587      3416          0.0      0.0      0.0          if self.time_dependent_vector_potential:
   588                                                       vector_potential, dA_dt = self.update_applied_vector_potential(time, dt)
   589      3416          0.0      0.0      0.0          self.vector_potential = vector_potential
   590                                           
   591      3416          0.4      0.0      1.1          old_sq_psi = xp.absolute(psi) ** 2
   592      3416          0.0      0.0      0.0          screening_error = np.inf
   593      3416          0.0      0.0      0.0          A_induced_vals = []
   594      3416          0.0      0.0      0.0          velocity = [0]  # Velocity for Polyak's method
   595                                                   # This loop runs only once if options.include_screening is False
   596      3416          0.0      0.0      0.0          for screening_iteration in itertools.count():
   597      3416          0.0      0.0      0.0              if screening_error < options.screening_tolerance:
   598                                                           break  # Screening calculation converged.
   599      3416          0.0      0.0      0.0              if screening_iteration > options.max_iterations_per_step:
   600                                                           raise RuntimeError(
   601                                                               f"Screening calculation failed to converge at step {step} after"
   602                                                               f" {options.max_iterations_per_step} iterations. Relative error in"
   603                                                               f" induced vector potential: {screening_error:.2e}"
   604                                                               f" (tolerance: {options.screening_tolerance:.2e})."
   605                                                           )
   606      3416          0.0      0.0      0.0              if options.include_screening:
   607                                                           # Update the link variables in the covariant Laplacian and gradient
   608                                                           # for psi based on the induced vector potential from the previous iteration.
   609                                                           operators.set_link_exponents(vector_potential + A_induced)
   610                                                           if self.time_dependent_vector_potential:
   611                                                               dA_dt = xp.einsum(
   612                                                                   "ij, ij -> i",
   613                                                                   (vector_potential + A_induced - A_applied) / dt,
   614                                                                   self.normalized_directions,
   615                                                               )
   616                                           
   617                                                       # Adjust the time step and calculate the new the order parameter
   618      3416          0.0      0.0      0.0              if screening_iteration == 0:
   619                                                           # Find a new time step only for the first screening iteration.
   620      3416          0.0      0.0      0.0                  dt = self.tentative_dt
   621                                           
   622      6832          5.1      0.0     14.3              psi, abs_sq_psi, dt = self.adaptive_euler_step(
   623      3416          0.0      0.0      0.0                  step, psi, old_sq_psi, mu, dt
   624                                                       )
   625      3416         27.7      0.0     77.6              mu, supercurrent, normal_current = self.solve_for_observables(psi, dA_dt)
   626                                           
   627      3416          0.0      0.0      0.0              if options.include_screening:
   628                                                           A_induced, screening_error = self.get_induced_vector_potential(
   629                                                               supercurrent + normal_current, A_induced_vals, velocity
   630                                                           )
   631                                                       else:
   632      3416          0.0      0.0      0.0                  break
   633                                           
   634      3416          0.2      0.0      0.4          running_state.append("dt", dt)
   635      3416          0.0      0.0      0.0          if self.probe_points is not None:
   636                                                       # Update the voltage and phase difference
   637      3416          0.7      0.0      1.8              running_state.append("mu", mu[self.probe_points])
   638      3416          0.6      0.0      1.7              running_state.append("theta", xp.angle(psi[self.probe_points]))
   639      3416          0.0      0.0      0.0          if options.include_screening:
   640                                                       running_state.append("screening_iterations", screening_iteration)
   641                                           
   642      3416          0.0      0.0      0.0          if options.adaptive:
   643                                                       # Compute the max abs change in |psi|^2, averaged over the adaptive window,
   644                                                       # and use it to select a new time step.
   645      3416          0.5      0.0      1.4              self.d_psi_sq_vals.append(float(xp.absolute(abs_sq_psi - old_sq_psi).max()))
   646      3416          0.0      0.0      0.0              window = options.adaptive_window
   647      3416          0.0      0.0      0.0              if step > window:
   648      6810          0.0      0.0      0.0                  new_dt = options.dt_init / max(
   649      3405          0.2      0.0      0.5                      1e-10, np.mean(self.d_psi_sq_vals[-window:])
   650                                                           )
   651      3405          0.2      0.0      0.5                  self.tentative_dt = np.clip(0.5 * (new_dt + dt), 0, self.dt_max)
   652                                           
   653      3416          0.0      0.0      0.0          results = [dt, psi, mu, supercurrent, normal_current, A_induced]
   654      3416          0.0      0.0      0.0          if self.time_dependent_vector_potential:
   655                                                       results.append(vector_potential)
   656                                                   else:
   657      3416          0.0      0.0      0.0              results.append(None)
   658      3416          0.0      0.0      0.0          return SolverResult(*results)

3. gpu = True, sparse_solver = "cupy" (83 seconds total)

Profile of tdgl.TDGLSolver.update()
Total time: 81.71 s
File: /usr/local/lib/python3.10/dist-packages/tdgl/solver/solver.py
Function: update at line 545

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   545                                               def update(
   546                                                   self,
   547                                                   state: Dict[str, Union[int, float]],
   548                                                   running_state: RunningState,
   549                                                   dt: float,
   550                                                   *,
   551                                                   psi: np.ndarray,
   552                                                   mu: np.ndarray,
   553                                                   supercurrent: np.ndarray,
   554                                                   normal_current: np.ndarray,
   555                                                   induced_vector_potential: np.ndarray,
   556                                                   applied_vector_potential: Optional[np.ndarray] = None,
   557                                               ) -> SolverResult:
   558                                                   """This method is called at each time step to update the state of the system.
   559                                           
   560                                                   Args:
   561                                                       state: The solver state, i.e., the solve step, time, and time step
   562                                                       running_state: A container for scalar data that is saved at each time step
   563                                                       dt: The time step for the previous solve step
   564                                                       psi: The order parameter
   565                                                       mu: The scalar potential
   566                                                       supercurrent: The supercurrent density
   567                                                       normal_current: The normal current density
   568                                                       induced_vector_potential: The induced vector potential
   569                                                       applied_vector_potential: The applied vector potential. This will be ``None``
   570                                                           in the case of a time-independent vector potential.
   571                                           
   572                                                   Returns:
   573                                                       A :class:`tdgl.SolverResult` instance for the solve step.
   574                                                   """
   575      3414          0.0      0.0      0.0          xp = self.xp
   576      3414          0.0      0.0      0.0          options = self.options
   577      3414          0.0      0.0      0.0          operators = self.operators
   578                                           
   579      3414          0.0      0.0      0.0          A_induced = induced_vector_potential
   580      3414          0.0      0.0      0.0          A_applied = applied_vector_potential
   581      3414          0.0      0.0      0.0          step = state["step"]
   582      3414          0.0      0.0      0.0          time = state["time"]
   583                                           
   584      3414          0.1      0.0      0.1          self.update_mu_boundary(time)
   585      3414          0.0      0.0      0.0          vector_potential = self.vector_potential
   586      3414          0.0      0.0      0.0          dA_dt = 0.0
   587      3414          0.0      0.0      0.0          if self.time_dependent_vector_potential:
   588                                                       vector_potential, dA_dt = self.update_applied_vector_potential(time, dt)
   589      3414          0.0      0.0      0.0          self.vector_potential = vector_potential
   590                                           
   591      3414          0.6      0.0      0.7          old_sq_psi = xp.absolute(psi) ** 2
   592      3414          0.0      0.0      0.0          screening_error = np.inf
   593      3414          0.0      0.0      0.0          A_induced_vals = []
   594      3414          0.0      0.0      0.0          velocity = [0]  # Velocity for Polyak's method
   595                                                   # This loop runs only once if options.include_screening is False
   596      3414          0.0      0.0      0.0          for screening_iteration in itertools.count():
   597      3414          0.0      0.0      0.0              if screening_error < options.screening_tolerance:
   598                                                           break  # Screening calculation converged.
   599      3414          0.0      0.0      0.0              if screening_iteration > options.max_iterations_per_step:
   600                                                           raise RuntimeError(
   601                                                               f"Screening calculation failed to converge at step {step} after"
   602                                                               f" {options.max_iterations_per_step} iterations. Relative error in"
   603                                                               f" induced vector potential: {screening_error:.2e}"
   604                                                               f" (tolerance: {options.screening_tolerance:.2e})."
   605                                                           )
   606      3414          0.0      0.0      0.0              if options.include_screening:
   607                                                           # Update the link variables in the covariant Laplacian and gradient
   608                                                           # for psi based on the induced vector potential from the previous iteration.
   609                                                           operators.set_link_exponents(vector_potential + A_induced)
   610                                                           if self.time_dependent_vector_potential:
   611                                                               dA_dt = xp.einsum(
   612                                                                   "ij, ij -> i",
   613                                                                   (vector_potential + A_induced - A_applied) / dt,
   614                                                                   self.normalized_directions,
   615                                                               )
   616                                           
   617                                                       # Adjust the time step and calculate the new the order parameter
   618      3414          0.0      0.0      0.0              if screening_iteration == 0:
   619                                                           # Find a new time step only for the first screening iteration.
   620      3414          0.0      0.0      0.0                  dt = self.tentative_dt
   621                                           
   622      6828          5.6      0.0      6.8              psi, abs_sq_psi, dt = self.adaptive_euler_step(
   623      3414          0.0      0.0      0.0                  step, psi, old_sq_psi, mu, dt
   624                                                       )
   625      3414          6.9      0.0      8.4              mu, supercurrent, normal_current = self.solve_for_observables(psi, dA_dt)
   626                                           
   627      3414          0.0      0.0      0.0              if options.include_screening:
   628                                                           A_induced, screening_error = self.get_induced_vector_potential(
   629                                                               supercurrent + normal_current, A_induced_vals, velocity
   630                                                           )
   631                                                       else:
   632      3414          0.0      0.0      0.0                  break
   633                                           
   634      3414          0.1      0.0      0.2          running_state.append("dt", dt)
   635      3414          0.0      0.0      0.0          if self.probe_points is not None:
   636                                                       # Update the voltage and phase difference
   637      3414          0.7      0.0      0.8              running_state.append("mu", mu[self.probe_points])
   638      3414          0.6      0.0      0.8              running_state.append("theta", xp.angle(psi[self.probe_points]))
   639      3414          0.0      0.0      0.0          if options.include_screening:
   640                                                       running_state.append("screening_iterations", screening_iteration)
   641                                           
   642      3414          0.0      0.0      0.0          if options.adaptive:
   643                                                       # Compute the max abs change in |psi|^2, averaged over the adaptive window,
   644                                                       # and use it to select a new time step.
   645      3414         66.7      0.0     81.7              self.d_psi_sq_vals.append(float(xp.absolute(abs_sq_psi - old_sq_psi).max()))
   646      3414          0.0      0.0      0.0              window = options.adaptive_window
   647      3414          0.0      0.0      0.0              if step > window:
   648      6806          0.0      0.0      0.0                  new_dt = options.dt_init / max(
   649      3403          0.2      0.0      0.3                      1e-10, np.mean(self.d_psi_sq_vals[-window:])
   650                                                           )
   651      3403          0.2      0.0      0.2                  self.tentative_dt = np.clip(0.5 * (new_dt + dt), 0, self.dt_max)
   652                                           
   653      3414          0.0      0.0      0.0          results = [dt, psi, mu, supercurrent, normal_current, A_induced]
   654      3414          0.0      0.0      0.0          if self.time_dependent_vector_potential:
   655                                                       results.append(vector_potential)
   656                                                   else:
   657      3414          0.0      0.0      0.0              results.append(None)
   658      3414          0.0      0.0      0.0          return SolverResult(*results)

Results with a bigger mesh

device.mesh_stats_dict() == {'num_sites': 77956,
 'num_elements': 153098,
 'min_edge_length': 0.011923119297168312,
 'max_edge_length': 0.05982660218605496,
 'mean_edge_length': 0.034855759324415085,
 'min_area': 7.306020090966249e-05,
 'max_area': 0.0018346826153318497,
 'mean_area': 0.0009875635412817574,
 'coherence_length': 0.5,
 'length_units': 'um'}
  1. gpu = False, sparse_solver = "superlu": 405 seconds total, 279 seconds spent in solve_for_observables, 220 seconds spent in SuperLU.solve
  2. gpu = True, sparse_solver = "superlu": 276 seconds total, 246 seconds spent in solve_for_observables, 222 seconds spent in SuperLU.solve
  3. gpu = True, sparse_solver = "cupy": 584 seconds total, approximately 584 - (276 - 222) = 530 seconds spent in cusparse.csrsm2

@loganbvh
Copy link
Owner Author

@Kyroba after some more testing (see above if you are interested), I have found that the GPU can provide a significant speedup (over 30% for the model I tested) but only if the main linear solve portion of the calculation is still done on the CPU. In other words, cusparse.csrsm2 on the GPU is much slower than SuperLU on the CPU for some reason, but using the GPU helps for the rest of the calculations. This is a bit surprising to me, but it is encouraging.

I expect that when using the GPU (options.gpu = True, options.sparse_solver = "superlu", "pardiso", or "umfpack"), the fraction of time spent inside SuperLU/PARDISO/UMFPACK) will increase toward 100% with increasing mesh size. Simulations including screening should also be much faster with options.gpu = True, options.sparse_solver = "superlu".

I still need to update the documentation, etc. over the coming days before merging all of these changes and making a new release. However if you would like, you are welcome to try it on the gpu branch

# In the conda env you used for the testing above
pip uninstall tdgl --yes
pip install git+https://github.com/loganbvh/py-tdgl.git@gpu

@loganbvh
Copy link
Owner Author

Closing this issue because I have merged #36 and released https://pypi.org/project/tdgl/0.6.0/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants