linear_solvers.tex

%\todo{Describe new linear solver packages, ShyLU, FROSch, new packages similar to old packages, new capabilities in these packages (two level DD), GPU supported features}
%
%\todo{@Siva/Alexander: do this!}

Trilinos offers many linear solver capabilities: dense and sparse direct solvers, iterative solvers, shared-memory preconditioners local to a compute node, and scalable distributed memory domain decomposition and multigrid methods. The capabilities described in this section are focused on using the Tpetra software stack. All of these solver capabilities are built on top of Kokkos and are GPU capable to varying degrees. We present any exception to this in the detailed descriptions below.

\subsection{Iterative Linear Solvers and Krylov Methods: Belos}

Belos~\cite{Bavier2012a} provides next-generation iterative linear solvers and 
a powerful developer framework for solving linear systems and least-squares problems.
This framework provides several abstractions that facilitate code reuse and extensibility.
First, Belos decouples the algorithms from the implementation of the underlying linear
algebra object using traits mechanisms.  While concrete linear algebra adapters are provided
for Tpetra and Thyra, users can also implement their own interfaces to leverage any
existing investment in their description of matrices and vectors.  Second, there are abstractions
to orthogonalization to ease the integration of application or architecture-specific orthogonalization
methods. Implementations of iterated classical Gram-Schmidt, DGKS-corrected classical Gram-Schmidt,
and iterated modified Gram-Schmidt are provided.  Finally, powerful solver managers encapsulate the 
strategy for solving a linear system or least-squares problem using abstract interfaces to iteration 
kernels.  As a result, the algorithms developed using Belos abstractions can be relatively agnostic
to data layout in memory or distributed over processors and parallel matrix/vector operations.

Belos supersedes the AztecOO package~\cite{Heroux2004a} providing solver managers for single-vector 
iterative solvers, but also extensions of these methods to block iterative methods for solving linear 
systems with multiple right-hand sides.  Single-vector solver managers include conjugate gradient (CG),
minimum residual (MINRES), generalized minimal residual (GMRES), stabilized biconjugate gradient (BiCGStab),
and transpose-free QMR (TFQMR) as well as ``seed'' solvers (hybrid GMRES, PCPG), subspace recycling solvers 
(GCRO-DR, RCG), and least-squares solvers (LSQR).  There are versions of CG, GMRES, and GCRO-DR that construct
a block Krylov subspace to solve for one or more right-hand sides.  There are also ``pseudo-block'' versions
of CG, GMRES, and TFQMR that apply the single-vector iteration simultaneously on multiple right-hand sides,
where matrix and preconditioner applications are aggregated to achieve better performance.  Any single-vector
iteration can be used to solve multiple right-hand sides as well but, if a block or pseudo-block version
is not used, the solver will sequentially solve the right-hand sides.

In recent years, application or architecture-specific variants of the classic iterative methods have been 
integrated into Belos.  Some of these variations are so minor that they are enabled through input
parameters on the classic method.  For instance, flexible GMRES~\cite{Saad1993a} is an option for the block 
GMRES solver, and pipelined CG~\cite{GHYSELS2014224} is an option for both the block CG and pseudo-block CG 
solver.  Additionally, there are stand-alone solvers for a pseudo-block stochastic CG method~\cite{Parker2012SamplingGD}
and a fixed-point iteration method.

    
\subsection{Onelevel Domain Decomposition and Basic Iterative Methods: Ifpack2}

Trilinos provides domain decomposition approaches in two different
packages: Ifpack2 and ShyLU (specifically the ShyLU\_DD
subpackage). Ifpack2 implements overlapping additive Schwarz
approaches with several options for the local subdomain solves. The
local subdomain solvers may either be CPU-only versions of incomplete
factorization preconditioners implemented in Ifpack2 itself, such as
ILU(k) and ILUt (thresholded ILU), or architecture portable algorithms
for incomplete factorizations and triangular solvers implemented in
Kokkos Kernels. It is possible to use direct solvers as subdomain
solvers as well. There are options to call shared-memory inexact
incomplete factorization preconditioners in ShyLU as well. One-level
preconditioners such as these are used as smoothers within multigrid
methods, for solving ``simpler'' problems where the setup cost of a
more robust multilevel methods is prohibitive when compared to the
reduction in the number of iterations, or when the underlying problem
is simply not amenable to multigrid methods. 

Ifpack2 also supplies classic iterative methods based on matrix-splitting techniques, such as Jacobi
iteration, Gauss-Seidel, and an MPI-oriented hybrid of Jacobi and
Gauss-Seidel (e.g. Jacobi between ranks and Gauss-Seidel on them).  Ifpack also has preconditioners
based on Chebyshev iteration.  The aforementioned preconditioners are available both in point and block
forms and can operate on CSR or BSR matrices.  In the block case,
line relaxation is also supported, while in the point case, techniques
like Vanka relaxation \cite{Vanka1986} are possible.  Auxiliary-space
smoothing for $H(curl)$ and $H(div)$ discretizations, of the style of
Hiptmair \cite{Hiptmair1997} are also supported.
While local kernels are implemented in Ifpack2 proper, it's also possible to call through
to Kokkos-Kernels for portable share-memory algorithmic variants of Gauss-Seidel and Jacobi iterations.

\subsection{Multilevel Domain Decomposition Methods: FROSch}
\label{ssec:frosch}

FROSch (Fast and Robust Overlapping Schwarz) is a framework for the construction of multilevel Schwarz domain decomposition solvers. Besides parallel scalability, FROSch focuses on a wide range of applicability and robustness for challenging problems while allowing for an algebraic construction of the Schwarz operators, that is, the construction is only based on the fully assembled system matrix. This is facilitated by an algebraic construction of an overlapping domain decomposition on the first level, as in Ifpack2, as well as the use of extension-based coarse spaces, such as in the classical two-level generalized Dryja--Smith--Widlund (GDSW) preconditioner~\cite{dohrmann_domain_2008} and related variants. While the first version was still based on the Epetra linear algebra framework~\cite{heinlein_parallel_2016}, the current implementation of FROSch is based on Xpetra~\cite{heinlein_frosch_2020}, which allows the use of both Epetra and Tpetra linear algebra stacks. Algorithmic variants of Schwarz methods implemented in FROSch include:
\begin{itemize}
	\item \emph{Extension-based coarse spaces based on a partition of unity on the interface}, such as classical GDSW, reduced dimension GDSW (RGDSW) coarse spaces, and multiscale finite element method (MsFEM) coarse spaces; cf.~\cite{heinlein_parallel_2016,heinlein_improving_2018}; 
	\item \emph{Monolithic Schwarz preconditioners} for block systems; cf.~\cite{heinlein_monolithic_2019}.
	\item \emph{Multilevel Schwarz preconditioners}, which are obtained from two-level Schwarz preconditioners by by recursively applying Schwarz preconditioners as an inexact solver for the coarse problems; cf.~\cite{heinlein_parallel_2022}.
\end{itemize}

FROSch has been applied to various challenging application problems, including scalar elliptic and elasticity problems~\cite{heinlein_parallel_2016}, computational fluid dynamics problems~\cite{heinlein_monolithic_2019}, pharmaco-mechanical interaction in arterial walls~\cite{balzani_computational_nodate}, and coupled multiphysics problems for land ice simulations~\cite{heinlein_frosch_2022}; the latter two have been solved using monolithic preconditioning techniques. To obtain robust convergence for heterogeneous model problems, spectral coarse spaces will be implemented shortly~\cite{heinlein_adaptive_2019}. FROSch preconditioners have scaled to more than 200\,k processor cores on the Theta Cray XC40 supercomputer at the Argonne Leadership Computing Facility (ALCF); cf.~\cite{heinlein_parallel_2022}.

In the current implementation, FROSch assumes a one-to-one correspondence of subdomains and MPI ranks, however, due to an interface to the other solver packages in Trilinos, inexact subdomain solvers can be employed on subdomains. An extension to multiple subdomains per MPI rank is currently being implemented. Using Kokkos and KokkosKernels, which are available through the Tpetra linear algebra framework, FROSch has recently also been ported to GPUs~\cite{Yamazaki:2022:EST}, with performance gains for the triangular solve or inexact solves with ILU on the GPUs.

A demo/tutorial for FROSch can be found at the GitHub repository~\cite{frosch_demo}.

\subsection{Multigrid Methods: MueLu}

MueLu is a flexible and scalable high-performance multigrid solver library.
It provides a variety of multigrid algorithms for problems ranging from Poisson-like operators over elasticity, convection-diffusion, and Navier-Stokes, and Maxwell’s equations
all the way to multigrid methods for coupled multiphysics systems.
Besides its strong focus on aggregation-based algebraic multigrid (AMG) methods,
MueLu comes with specialized capabilities for (semi-)structured grids to perform semi-coarsening along grid lines,
yet forming the coarse operator via a Galerkin product (in contrast to classical geometric multigrid).
MueLu is extensible and allows for the research and development of new multigrid preconditioning methods.
Its weak and strong scalability even for vector-valued PDEs on unstructured meshes
up to 131,000 cores of a Cray XC40 and one million cores of a Blue Gene/Q system have been shown in~\cite{Lin2017a,Thomas2019a}.

MueLu has a required dependency on Kokkos,
and the main code routes in MueLu (e.g. smoothed aggregation, semi-coarsening, p-coarsening, structured aggregation)
utilize Kokkos and have been shown to scale on device \cite{BettencourtBrownEtAl2021_EmpirePic}.
Some older algorithms such as energy minimization and the variable degree-of-freedom Laplacian have not been adapted to fully utilize Kokkos yet;
however, they are part of ongoing efforts to use Kokkos throughout MueLu completely and to merge serial and device-capable algorithms.

MueLu provides several approaches to constructing and solving the multilevel problem:

\begin{itemize}
\item \emph{Algebraic smoothed aggregation approach}~\cite{Vanek1996a}:
The matrix graph is colored to create aggregates (groups) of nodes.
These aggregates define a tentative projection operator.
A final projection operator is created by applying a smoother to the tentative operator.
There are a variety of deterministic and non-deterministic coloring algorithms implemented directly
in MueLu or in Kokkos-Kernels.  For a full description, see~\cite{BergerVergiat2023a}.

\item \emph{Algebraic multigrid for Maxwell’s equations}:
  MueLu implements specialized solvers for the solution of curl-curl problems~\cite{BochevHuEtAl2008_AlgebraicMultigridApproachBased}.
  Scaling results on Haswell, KNL, ARM and NVIDIA V100 GPUs at full machine scale can be found in~\cite{BettencourtBrownEtAl2021_EmpirePic}.

\item \emph{Multigrid for multiphysics}:
MueLu implements a tool box to compile multi-level block preconditioners for block matrices arising from coupled multiphysics problems.
Applications range from Navier-Stokes equations
over surface-coupling (as in fluid/structure interaction or contact mechanics~\cite{Wiesner2021a})
to volume-coupled problems (e.g. in magneto-hydro dynamics~\cite{Ohm2022a}),
but also cover non-traditional approaches (e.g. mixed-dimensional modeling of fibers embedded into solid continua~\cite{Firmbach2024a}).

\item \emph{Semi-coarsening algebraic multigrid approach}~\cite{Tuminaro2016a}:
Specialized aggregation procedures for three-dimensional meshes generated by extrusion of a two-dimensional unstructured grid
allow to first coarsen in the direction of extrusion to reduce the system to a two-dimensional representation and then perform classical aggregation-based AMG
for the remaining coarsenings.

\item \emph{AMG for (semi-)structured grids}:
MueLu has a structured aggregation capability in which the user may specify the coarsening rate used to compute interpolation operators.
The coarse operators are then formed via a Galerkin product to avoid remeshing on the coarse levels.
This work has been extended to semi-structured grids to leverage structured-grid computational performance also for globally unstructured grids~\cite{Mayr2022a}.

\item \emph{Geometric multigrid / matrix-free capabilities}:
MueLu treats many of the objects in its hierarchy as operators instead of matrices when possible.
This enables one to use MueLu in a matrix-free fashion provided the user can supply their own grid transfers and coarse operators.
% through factories such as \texttt{MatrixFreeTentativePFactory},
% which only requires aggregates in order to perform a grid transfer using a tentative prolongator.
% Other grid transfers such as structured grid transfers discussed above may also be performed in a matrix-free fashion,
% and simple smoothers such as Jacobi and Chebyshev may also be performed on matrix-free operators.

\end{itemize}

Several resources provide insight into MueLu:
An overview is given on the MueLu website\footnote{\url{https://trilinos.github.io/muelu.html}}.
The MueLu User's Guide~\cite{BergerVergiat2023a} summarizes installation instructions and a reference to most of MueLu's configuration parameters.
The MueLu Tutorial~\cite{Mayr2023b} introduces beginners and experts to various topics in MueLu and shows how to solve or precondition different linear systems using MueLu.
Details on the compatibility of MueLu and its predecessor ML~\cite{Heroux2005a,Gee2006a} can be found in the MueLu User's Guide~\cite{BergerVergiat2023a}.

Besides its C++ API, MueLu offers a MATLAB interface, MueMex, to provide access to MueLu's aggregation and solver routines from MATLAB.
MueMex allows users to setup and solve arbitrarily many problems with either MueLu as a preconditioner, Belos as a solver and Epetra or Tpetra for data structures.

While MueLu requires Teuchos, Tpetra, Xpetra, Kokkos, and Kokkos-Kernels, it has a strong dependence 
on the Amesos2, Ifpack2, and Zoltan2 libraries.
MueLu also supports interfaces to abstraction layer packages such as Stratimikos and Thyra though the MueLu adapters library.
These interfaces are required to use MueLu with the Teko block preconditioning package.
For more details on using MueLu with Teko, see the MueLu examples directory in the Trilinos source code repository.


\subsection{Direct Linear Solvers: Amesos2, ShyLU}

Amesos2~\cite{Bavier2012a} provides a uniform interface to direct linear solvers.
These include third-party direct solvers such as CHOLMOD, MUMPS, Pardiso\_MKL, SuperLU, SuperLU\_MT, SuperLU\_Dist, and STRUMPACK.
These solvers may be used as the local solver for the domain decomposition preconditioner (Ifpack2 or FROSch)
or as the coarse solver for multigrid preconditioners (MueLu).

Amesos2 also provides the interface to two native Trilinos on-node sparse direct solvers, Basker and Tacho implemented in ShyLU.
Basker~\cite{Basker2017} is a sparse direct solver based on LU factorization for the problems that have the block triangular form (BTF) typically seen in circuit simulation applications. Basker uses these structures to factor and solve the diagonal blocks in parallel. The larger diagonal blocks can themselves be factored in parallel by discovering the parallelism available using a nested-dissection reordering. Basker focuses on exploiting thread-parallelism on the multi-core CPU architectures. %However, we have still implemented the solver using Kokkos.
Amesos2 also has a templated implementation of the sequential KLU solver called KLU2, which also exploits the BTF structure.


Tacho is a sparse direct solver that exploits the supernodal block structures, commonly found in the sparse direct factorization of the matrices from the mechanics applications. Tacho exploits this supernodal structure for both factorization and triangular solve phases.  It is based on Kokkos, and hence it is portable to different node architectures (including NVIDIA or AMD GPUs). Originally, Tacho implemented task-parallel Cholesky of a sparse symmetric positive definite (SPD) matrix~\cite{Tacho2018}. However, to improve its portability, it has been extended to compute the sparse factorization based on level-set scheduling. Moreover, its functionality has been extended to compute LDLt factorization of symmetric indefinite matrix and LU factorization of a general matrix with a symmetric sparsity structure.


ShyLU also implements a distributed-memory linear solver based on the Schur complement method~\cite{ShyLUCore2014}. This is a hybrid direct and iterative solver where each subdomain problem is solved using a direct solver in parallel while the Schur complement is solved using an iterative approach. The preconditioner for the Schur complement solver is computed using a probing approach or using a threshold based dropping strategy. This solver was developed to address the requirements of circuit simulation applications. The solver is also hybrid in the parallel computing sense as it uses MPI+threads. This  solver algorithm is not focused on GPU architectures. As a result, we have not implemented this solver using Kokkos. This solver has been shown to be useful for circuit simulation applications.


\subsection{Physics block operators and preconditioners: Teko}
\label{sec:teko}

The Teko library~\cite{Cyr2016a} provides interfaces for operators and preconditioners that are constructed from large physics sub-blocks.
Sub-blocks themselves can be Tpetra or Epetra matrices.
Generic block preconditioning strategies such as Jacobi and Gauss-Seidel and commonly used approximate inverse strategies for the Navier-Stokes equation such as SIMPLEC, LSC and PCD are provided \cite{CyrShadidEtAl2012_StabilizationScalableBlockPreconditioning}.
Block preconditioners for first order formulations of Maxwell's equations and Darcy flow are distributed with the Panzer package.


\subsection{Eigensolvers: Anasazi}
Anasazi~\cite{Baker2009a} is an extensible and interoperable framework for large-scale eigenvalue algorithms.
This framework provides a generic interface to a collection of algorithms that are built upon abstract interfaces
that facilitate code reuse and extensibility.  Similar to Belos, Anasazi decouples the algorithms from the 
implementation of the underlying linear algebra objects using traits mechanisms.  Concrete linear algebra adapters
are provided for Tpetra and Thyra, while users can also implement their own interfaces to leverage any existing
investment in their description of matrices and vectors.  Any libraries that understand Tpetra and Thyra matrices
and vectors, like Belos and Ifpack2, may also be used in conjunction with Anasazi.  The suite of eigensolvers provided
by Anasazi includes locally-optimal block preconditioned conjugate gradient (LOBPCG), block Davidson, Riemannian Trust-Region
(RTR), and block Krylov-Schur.  Recently, there has been a family of trace minimization (TraceMin) methods and a
generalized Davidson method added to the suite of eigensolvers in Anasazi. 


\subsection{Unified solver interface: Stratimikos}

The Stratimikos package provides a unified interface to linear solvers and preconditioners in Trilinos (e.g. from Belos, Amesos2, Ifpack2, , MueLu, FROSch).
The linear system as well as right-hand side and solution vectors are required to support the Thyra interface.
Wrappers for Tpetra linear algebra are provided by Thyra.
Solver and preconditioner parameters are specified via a Teuchos parameter list.
Additional solver and preconditioner factories can easily be added via adapters (see \texttt{packages/stratimikos/adapters} for examples).