\section{Multigrid Background\label{sec:background}} \markboth{\textsc{MLD2P4 User's and Reference Guide}} {\textsc{\ref{sec:background} Multigrid Background}} Multigrid preconditioners, coupled with Krylov iterative solvers, are widely used in the parallel solution of large and sparse linear systems, because of their optimality in the solution of linear systems arising from the discretization of scalar elliptic Partial Differential Equations (PDEs) on regular grids. Optimality, also known as algorithmic scalability, is the property of having a computational cost per iteration that depends linearly on the problem size, and a convergence rate that is independent of the problem size. Multigrid preconditioners are based on a recursive application of a two-grid process consisting of smoother iterations and a coarse-space (or coarse-level) correction. The smoothers may be either basic iterative methods, such as the Jacobi and Gauss-Seidel ones, or more complex subspace-correction methods, such as the Schwarz ones. The coarse-space correction consists of solving, in an appropriately chosen coarse space, the residual equation associated with the approximate solution computed by the smoother, and of using the solution of this equation to correct the previous approximation. The transfer of information between the original (fine) space and the coarse one is performed by using suitable restriction and prolongation operators. The construction of the coarse space and the corresponding transfer operators is carried out by applying a so-called coarsening algorithm to the system matrix. Two main approaches can be used to perform coarsening: the geometric approach, which exploits the knowledge of some physical grid associated with the matrix and requires the user to define transfer operators from the fine to the coarse level and vice versa, and the algebraic approach, which builds the coarse-space correction and the associate transfer operators using only matrix information. The first approach may be difficult when the system comes from discretizations on complex geometries; furthermore, ad hoc one-level smoothers may be required to get an efficient interplay between fine and coarse levels, e.g., when matrices with highly varying coefficients are considered. The second approach performs a fully automatic coarsening and enforces the interplay between fine and coarse level by suitably choosing the coarse space and the coarse-to-fine interpolation (see, e.g., \cite{Briggs2000,Stuben_01,dd2_96} for details.) MLD2P4 uses a pure algebraic approach, based on the smoothed aggregation algorithm \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA}, for building the sequence of coarse matrices and transfer operators, starting from the original one. A decoupled version of this algorithm is implemented, where the smoothed aggregation is applied locally to each submatrix \cite{TUMINARO_TONG}. A brief description of the AMG preconditioners implemented in MLD2P4 is given in Sections~\ref{sec:multilevel}-\ref{sec:smoothers}. For further details the reader is referred to \cite{para_04,aaecc_07,apnum_07,MLD2P4_TOMS}. We note that optimal multigrid preconditioners do not necessarily correspond to minimum execution times in a parallel setting. Indeed, to obtain effective parallel multigrid preconditioners, a tradeoff between the optimality and the cost of building and applying the smoothers and the coarse-space corrections must be achieved. Effective parallel preconditioners require algorithmic scalability to be coupled with implementation scalability, i.e., a computational cost per iteration which remains (almost) constant as the number of parallel processors increases. \subsection{AMG preconditioners\label{sec:multilevel}} In order to describe the AMG preconditioners available in MLD2P4, we consider a linear system \begin{equation} Ax=b, \label{eq:system} \end{equation} where $A=(a_{ij}) \in \mathbb{R}^{n \times n}$ is a nonsingular sparse matrix; for ease of presentation we assume $A$ has a symmetric sparsity pattern. Let us consider as finest index space the set of row (column) indices of $A$, i.e., $\Omega = \{1, 2, \ldots, n\}$. Any algebraic multilevel preconditioners implemented in MLD2P4 generates a hierarchy of index spaces and a corresponding hierarchy of matrices, \[ \Omega^1 \equiv \Omega \supset \Omega^2 \supset \ldots \supset \Omega^{nlev}, \quad A^1 \equiv A, A^2, \ldots, A^{nlev}, \] by using the information contained in $A$, without assuming any knowledge of the geometry of the problem from which $A$ originates. A vector space $\mathbb{R}^{n_{k}}$ is associated with $\Omega^k$, where $n_k$ is the size of $\Omega^k$. For all $k < nlev$, a restriction operator and a prolongation one are built, which connect two levels $k$ and $k+1$: \[ P^k \in \mathbb{R}^{n_k \times n_{k+1}}, \quad R^k \in \mathbb{R}^{n_{k+1}\times n_k}; \] the matrix $A^{k+1}$ is computed by using the previous operators according to the Galerkin approach, i.e., \[ A^{k+1}=R^kA^kP^k. \] In the current implementation of MLD2P4 we have $R^k=(P^k)^T$ A smoother with iteration matrix $M^k$ is set up at each level $k < nlev$, and a solver is set up at the coarsest level, so that they are ready for application (for example, setting up a solver based on the $LU$ factorization means computing and storing the $L$ and $U$ factors). The construction of the hierarchy of AMG components described so far corresponds to the so-called build phase of the preconditioner. \begin{figure}[t] \begin{center} \framebox{ \begin{minipage}{.85\textwidth} \begin{tabbing} \quad \=\quad \=\quad \=\quad \\[-3mm] procedure V-cycle$\left(k,A^k,b^k,u^k\right)$ \\[2mm] \>if $\left(k \ne nlev \right)$ then \\[1mm] \>\> $u^k = u^k + M^k \left(b^k - A^k u^k\right)$ \\[1mm] \>\> $b^{k+1} = R^{k+1}\left(b^k - A^k u^k\right)$ \\[1mm] \>\> $u^{k+1} =$ V-cycle$\left(k+1,A^{k+1},b^{k+1},0\right)$ \\[1mm] \>\> $u^k = u^k + P^{k+1} u^{k+1}$ \\[1mm] \>\> $u^k = u^k + M^k \left(b^k - A^k u^k\right)$ \\[1mm] \>else \\[1mm] \>\> $u^k = \left(A^k\right)^{-1} b^k$\\[1mm] \>endif \\[1mm] \>return $u^k$ \\[1mm] end \end{tabbing} \end{minipage} } \caption{Application phase of a V-cycle preconditioner.\label{fig:application_alg}} \end{center} \end{figure} The components produced in the build phase may be combined in several ways to obtain different multilevel preconditioners; this is done in the application phase, i.e., in the computation of a vector of type $w=B^{-1}v$, where $B$ denotes the preconditioner, usually within an iteration of a Krylov solver \cite{Saad_book}. An example of such a combination, known as V-cycle, is given in Figure~\ref{fig:application_alg}. In this case, a single iteration of the same smoother is used before and after the the recursive call to the V-cycle (i.e., in the pre-smoothing and post-smoothing phases); however, different choices can be performed. Other cycles can be defined; in MLD2P4, we implemented the standard V-cycle and W-cycle~\cite{Briggs2000}, and a version of the K-cycle described in~\cite{Notay2008}. \subsection{Smoothed Aggregation\label{sec:aggregation}} In order to define the prolongator $P^k$, used to compute the coarse-level matrix $A^{k+1}$, MLD2P4 uses the smoothed aggregation algorithm described in \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA}. The basic idea of this algorithm is to build a coarse set of indices $\Omega^{k+1}$ by suitably grouping the indices of $\Omega^k$ into disjoint subsets (aggregates), and to define the coarse-to-fine space transfer operator $P^k$ by applying a suitable smoother to a simple piecewise constant prolongation operator, with the aim of improving the quality of the coarse-space correction. Three main steps can be identified in the smoothed aggregation procedure: \begin{enumerate} \item aggregation of the indices of $\Omega^k$ to obtain $\Omega^{k+1}$; \item construction of the prolongator $P^k$; \item application of $P^k$ and $R^k=(P^k)^T$ to build $A^{k+1}$. \end{enumerate} In order to perform the coarsening step, the smoothed aggregation algorithm described in~\cite{VANEK_MANDEL_BREZINA} is used. In this algorithm, each index $j \in \Omega^{k+1}$ corresponds to an aggregate $\Omega^k_j$ of $\Omega^k$, consisting of a suitably chosen index $i \in \Omega^k$ and indices that are (usually) contained in a strongly-coupled neighborood of $i$, i.e., \begin{equation} \label{eq:strongly_coup} \Omega^k_j \subset \mathcal{N}_i^k(\theta) = \left\{ r \in \Omega^k: |a_{ir}^k| > \theta \sqrt{|a_{ii}^ka_{rr}^k|} \right \} \cup \left\{ i \right\}, \end{equation} for a given threshold $\theta \in [0,1]$ (see~\cite{VANEK_MANDEL_BREZINA} for the details). Since this algorithm has a sequential nature, a decoupled version of it is applied, where each processor independently executes the algorithm on the set of indices assigned to it in the initial data distribution. This version is embarrassingly parallel, since it does not require any data communication. On the other hand, it may produce some nonuniform aggregates and is strongly dependent on the number of processors and on the initial partitioning of the matrix $A$. Nevertheless, this parallel algorithm has been chosen for MLD2P4, since it has been shown to produce good results in practice \cite{aaecc_07,apnum_07,TUMINARO_TONG}. The prolongator $P^k$ is built starting from a tentative prolongator $\bar{P}^k \in \mathbb{R}^{n_k \times n_{k+1}}$, defined as \begin{equation} \bar{P}^k =(\bar{p}_{ij}^k), \quad \bar{p}_{ij}^k = \left\{ \begin{array}{ll} 1 & \quad \mbox{if} \; i \in \Omega^k_j, \\ 0 & \quad \mbox{otherwise}, \end{array} \right. \label{eq:tent_prol} \end{equation} where $\Omega^k_j$ is the aggregate of $\Omega^k$ corresponding to the index $j \in \Omega^{k+1}$. $P^k$ is obtained by applying to $\bar{P}^k$ a smoother $S^k \in \mathbb{R}^{n_k \times n_k}$: $$ P^k = S^k \bar{P}^k, $$ in order to remove nonsmooth components from the range of the prolongator, and hence to improve the convergence properties of the multilevel method~\cite{BREZINA_VANEK,Stuben_01}. A simple choice for $S^k$ is the damped Jacobi smoother: \[ S^k = I - \omega^k (D^k)^{-1} A^k_F , \] where $D^k$ is the diagonal matrix with the same diagonal entries as $A^k$, $A^k_F = (\bar{a}_{ij}^k)$ is the filtered matrix defined as \begin{equation} \label{eq:filtered} \bar{a}_{ij}^k = \left \{ \begin{array}{ll} a_{ij}^k & \mbox{if } j \in \mathcal{N}_i^k(\theta), \\ 0 & \mbox{otherwise}, \end{array} \right. \; (j \ne i), \qquad \bar{a}_{ii}^k = a_{ii}^k - \sum_{j \ne i} (a_{ij}^k - \bar{a}_{ij}^k), \end{equation} and $\omega^k$ is an approximation of $4/(3\rho^k)$, where $\rho^k$ is the spectral radius of $(D^k)^{-1}A^k_F$ \cite{BREZINA_VANEK}. In MLD2P4 this approximation is obtained by using $\| A^k_F \|_\infty$ as an estimate of $\rho^k$. Note that for systems coming from uniformly elliptic problems, filtering the matrix $A^k$ has little or no effect, and $A^k$ can be used instead of $A^k_F$. The latter choice is the default in MLD2P4. \subsection{Smoothers and coarsest-level solvers\label{sec:smoothers}} The smoothers implemented in MLD2P4 include the Jacobi and block-Jacobi methods, a hybrid version of the forward and backward Gauss-Seidel methods, and the additive Schwarz (AS) ones (see, e.g., \cite{Saad_book,dd2_96}). The hybrid Gauss-Seidel version is considered because the original Gauss-Seidel method is inherently sequential. At each iteration of the hybrid version, each parallel process uses the most recent values of its own local variables and the values of the non-local variables computed at the previous iteration, obtained by exchanging data with other processes before the beginning of the current iteration. In the AS methods, the index space $\Omega^k$ is divided into $m_k$ subsets $\Omega^k_i$ of size $n_{k,i}$, possibly overlapping. For each $i$ we consider the restriction operator $R_i^k \in \mathbb{R}^{n_{k,i} \times n_k}$ that maps a vector $x^k$ to the vector $x_i^k$ made of the components of $x^k$ with indices in $\Omega^k_i$, and the prolongation operator $P^k_i = (R_i^k)^T$. These operators are then used to build $A_i^k=R_i^kA^kP_i^k$, which is the restriction of $A^k$ to the index space $\Omega^k_i$. The classical AS preconditioner $M^k_{AS}$ is defined as \[ ( M^k_{AS} )^{-1} = \sum_{i=1}^{m_k} P_i^k (A_i^k)^{-1} R_i^{k}, \] where $A_i^k$ is supposed to be nonsingular. We observe that an approximate inverse of $A_i^k$ is usually considered instead of $(A_i^k)^{-1}$. The setup of $M^k_{AS}$ during the multilevel build phase involves \begin{itemize} \item the definition of the index subspaces $\Omega_i^k$ and of the corresponding operators $R_i^k$ (and $P_i^k$); \item the computation of the submatrices $A_i^k$; \item the computation of their inverses (usually approximated through some form of incomplete factorization). \end{itemize} The computation of $z^k=M^k_{AS}w^k$, with $w^k \in \mathbb{R}^{n_k}$, during the multilevel application phase, requires \begin{itemize} \item the restriction of $w^k$ to the subspaces $\mathbb{R}^{n_{k,i}}$, i.e.\ $w_i^k = R_i^{k} w^k$; \item the computation of the vectors $z_i^k=(A_i^k)^{-1} w_i^k$; \item the prolongation and the sum of the previous vectors, i.e.\ $z^k = \sum_{i=1}^{m_k} P_i^k z_i^k$. \end{itemize} Variants of the classical AS method, which use modifications of the restriction and prolongation operators, are also implemented in MLD2P4. Among them, the Restricted AS (RAS) preconditioner usually outperforms the classical AS preconditioner in terms of convergence rate and of computation and communication time on parallel distributed-memory computers, and is therefore the most widely used among the AS preconditioners~\cite{CAI_SARKIS}. Direct solvers based on sparse LU factorizations, implemented in the third-party libraries reported in Section~\ref{sec:third-party}, can be applied as coarsest-level solvers by MLD2P4. Native inexact solvers based on incomplete LU factorizations, as well as Jacobi, hybrid (forward) Gauss-Seidel, and block Jacobi preconditioners are also available. Direct solvers usually lead to more effective preconditioners in terms of algorithmic scalability; however, this does not guarantee parallel efficiency. %%% Local Variables: %%% mode: latex %%% TeX-master: "userguide" %%% End: