\section{Multi-level Domain Decomposition Background\label{sec:background}} \markboth{\textsc{MLD2P4 User's and Reference Guide}} {\textsc{\ref{sec:background} Multi-level Domain Decomposition Background}} \emph{Domain Decomposition} (DD) preconditioners, coupled with Krylov iterative solvers, are widely used in the parallel solution of large and sparse linear systems. These preconditioners are based on the divide and conquer technique: the matrix to be preconditioned is divided into submatrices, a ``local'' linear system involving each submatrix is (approximately) solved, and the local solutions are used to build a preconditioner for the whole original matrix. This process often corresponds to dividing a physical domain associated to the original matrix into subdomains, e.g. in a PDE discretization, to (approximately) solving the subproblems corresponding to the subdomains and to building an approximate solution of the original problem from the local solutions \cite{Cai_Widlund_92,dd1_94,dd2_96}. \emph{Additive Schwarz} preconditioners are DD preconditioners using overlapping submatrices, i.e.\ with some common rows, to couple the local information related to the submatrices (see, e.g., \cite{dd2_96}). The main motivation for choosing Additive Schwarz preconditioners is their intrinsic parallelism. A drawback of these preconditioners is that the number of iterations of the preconditioned solvers generally grows with the number of submatrices. This may be a serious limitation on parallel computers, since the number of submatrices usually matches the number of available processors. Optimal convergence rates, i.e.\ iteration numbers independent of the number of submatrices, can be obtained by correcting the preconditioner through a suitable approximation of the original linear system in a coarse space, which globally couples the information related to the single submatrices. \emph{Two-level Schwarz} preconditioners are obtained by combining basic (one-level) Sch\-warz preconditioners with a coarse-level correction. In this context, the one-level preconditioner is often called `smoother'. Different two-level preconditioners are obtained by varying the choice of the smoother and of the coarse-level correction, and the way they are combined \cite{dd2_96}. The same reasoning can be applied starting from the coarse-level system, i.e.\ a coarse-space correction can be built from this system, thus obtaining \emph{multi-level} preconditioners. It is worth noting that optimal preconditioners do not necessarily correspond to minimum execution times. Indeed, to obtain effective multi-level preconditioners a tradeoff between optimality of convergence and the cost of building and applying the coarse-space corrections must be achieved. The choice of the number of levels, i.e.\ of the coarse-space corrections, also affects the effectiveness of the preconditioners. One more goal is to get convergence rates as less sensitive as possible to variations in the matrix coefficients. Two main approaches can be used to build coarse-space corrections. The geometric approach applies coarsening strategies based on the knowledge of some physical grid associated to the matrix and requires the user to define grid transfer operators from the fine to the coarse levels and vice versa. This may result difficult for complex geometries; furthermore, suitable one-level preconditioners may be required to get efficient interplay between fine and coarse levels, e.g.\ when matrices with highly varying coefficients are considered. The algebraic approach builds coarse-space corrections using only matrix information. It performs a fully automatic coarsening and enforces the interplay between the fine and coarse levels by suitably choosing the coarse space and the coarse-to-fine interpolation \cite{StubenGMD69_99}. MLD2P4 uses a pure algebraic approach for building the sequence of coarse matrices starting from the original matrix. The algebraic approach is based on the \emph{smoothed aggregation} algorithm \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA}. A decoupled version of this algorithm is implemented, where the smoothed aggregation is applied locally to each submatrix \cite{TUMINARO_TONG}. In the next two subsections we provide a brief description of the multi-level Schwarz preconditioners and of the smoothed aggregation technique as implemented in MLD2P4. For further details the user is referred to \cite{para_04,aaecc_07,apnum_07,dd2_96}. \subsection{Multi-level Schwarz Preconditioners\label{sec:multilevel}} The Multilevel preconditioners implemented in MLD2P4 are obtained by combining AS preconditioners with coarse-space corrections; therefore we first provide a sketch of the AS preconditioners. Given the linear system \Ref{system1}, where $A=(a_{ij}) \in \Re^{n \times n}$ is a nonsingular sparse matrix with a symmetric nonzero pattern, let $G=(W,E)$ be the adjacency graph of $A$, where $W=\{1, 2, \ldots, n\}$ and $E=\{(i,j) : a_{ij} \neq 0\}$ are the vertex set and the edge set of $G$, respectively. Two vertices are called adjacent if there is an edge connecting them. For any integer $\delta > 0$, a $\delta$-overlap partition of $W$ can be defined recursively as follows. Given a 0-overlap (or non-overlapping) partition of $W$, i.e.\ a set of $m$ disjoint nonempty sets $W_i^0 \subset W$ such that $\cup_{i=1}^m W_i^0 = W$, a $\delta$-overlap partition of $W$ is obtained by considering the sets $W_i^\delta \supset W_i^{\delta-1}$ obtained by including the vertices that are adjacent to any vertex in $W_i^{\delta-1}$. Let $n_i^\delta$ be the size of $W_i^\delta$ and $R_i^{\delta} \in \Re^{n_i^\delta \times n}$ the restriction operator that maps a vector $v \in \Re^n$ onto the vector $v_i^{\delta} \in \Re^{n_i^\delta}$ containing the components of $v$ corresponding to the vertices in $W_i^\delta$. The transpose of $R_i^{\delta}$ is a prolongation operator from $\Re^{n_i^\delta}$ to $\Re^n$. The matrix $A_i^\delta=R_i^\delta A (R_i^\delta)^T \in \Re^{n_i^\delta \times n_i^\delta}$ can be considered as a restriction of $A$ corresponding to the set $W_i^{\delta}$. The \emph{classical one-level AS} preconditioner is defined by \[ M_{AS}^{-1}= \sum_{i=1}^m (R_i^{\delta})^T (A_i^\delta)^{-1} R_i^{\delta}, \] where $A_i^\delta$ is assumed to be nonsingular. Its application to a vector $v \in \Re^n$ within a Krylov solver requires the following three steps: \begin{enumerate} \item restriction of $v$ as $v_i = R_i^{\delta} v$, $i=1,\ldots,m$; \item solution of the linear systems $A_i^\delta w_i = v_i$, $i=1,\ldots,m$; \item prolongation and sum of the $w_i$'s, i.e. $w = \sum_{i=1}^m (R_i^{\delta})^T w_i$. \end{enumerate} Note that the linear systems at step 2 are usually solved approximately, e.g.\ using incomplete LU factorizations such as ILU($p$), MILU($p$) and ILU($p,t$) \cite[Chapter 10]{Saad_book}. A variant of the classical AS preconditioner that outperforms it in terms of convergence rate and of computation and communication time on parallel distributed-memory computers is the so-called \emph{Restricted AS (RAS)} preconditioner~\cite{CAI_SARKIS,EFSTATHIOU}. It is obtained by zeroing the components of $w_i$ corresponding to the overlapping vertices when applying the prolongation. Therefore, RAS differs from classical AS by the prolongation operators, which are substituted by $(\tilde{R}_i^0)^T \in \Re^{n_i^\delta \times n}$, where $\tilde{R}_i^0$ is obtained by zeroing the rows of $R_i^\delta$ corresponding to the vertices in $W_i^\delta \backslash W_i^0$: \[ M_{RAS}^{-1}= \sum_{i=1}^m (\tilde{R}_i^0)^T (A_i^\delta)^{-1} R_i^{\delta}. \] Analogously, the AS variant called \emph{AS with Harmonic extension (ASH)} is defined by \[ M_{ASH}^{-1}= \sum_{i=1}^m (R_i^{\delta})^T (A_i^\delta)^{-1} \tilde{R}_i^0. \] We note that for $\delta=0$ the three variants of the AS preconditioner are all equal to the block-Jacobi preconditioner. As already observed, the convergence rate of the one-level Schwarz preconditioned iterative solvers deteriorates as the number $m$ of partitions of $W$ increases \cite{dd1_94,dd2_96}. To reduce the dependency of the number of iterations on the degree of parallelism we may introduce a global coupling among the overlapping partitions by defining a coarse-space approximation $A_C$ of the matrix $A$. In a pure algebraic setting, $A_C$ is usually built with a Galerkin approach. Given a set $W_C$ of \emph{coarse vertices}, with size $n_C$, and a suitable restriction operator $R_C \in \Re^{n_C \times n}$, $A_C$ is defined as \[ A_C=R_C A R_C^T \] and the coarse-level correction matrix to be combined with a generic one-level AS preconditioner $M_{1L}$ is obtained as \[ M_{C}^{-1}= R_C^T A_C^{-1} R_C, \] where $A_C$ is assumed to be nonsingular. The application of $M_{C}^{-1}$ to a vector $v$ corresponds to a restriction, a solution and a prolongation step; the solution step, involving the matrix $A_C$, may be carried out also approximately. The combination of $M_{C}$ and $M_{1L}$ may be performed in either an additive or a multiplicative framework. In the former case, the \emph{two-level additive} Schwarz preconditioner is obtained: \[ M_{2LA}^{-1} = M_{C}^{-1} + M_{1L}^{-1}. \] Applying $M_{2L-A}^{-1}$ to a vector $v$ within a Krylov solver corresponds to applying $M_{C}^{-1}$ and $M_{1L}^{-1}$ to $v$ independently and then summing up the results. In the multiplicative case, the combination can be performed by first applying the smoother $M_{1L}^{-1}$ and then the coarse-level correction operator $M_{C}^{-1}$: \[ \begin{array}{l} w = M_{1L}^{-1} v, \\ z = w + M_{C}^{-1} (v-Aw); \end{array} \] this corresponds to the following \emph{two-level hybrid pre-smoothed} Schwarz preconditioner: \[ M_{2LH-PRE}^{-1} = M_{C}^{-1} + \left( I - M_{C}^{-1}A \right) M_{1L}^{-1}. \] On the other hand, by applying the smoother after the coarse-level correction, i.e.\ by computing \[ \begin{array}{l} w = M_{C}^{-1} v , \\ z = w + M_{1L}^{-1} (v-Aw) , \end{array} \] the \emph{two-level hybrid post-smoothed} Schwarz preconditioner is obtained: \[ M_{2LH-POST}^{-1} = M_{1L}^{-1} + \left( I - M_{1L}^{-1}A \right) M_{C}^{-1}. \] One more variant of two-level hybrid preconditioner is obtained by applying the smoother before and after the coarse-level correction. In this case, the preconditioner is symmetric if $A$, $M_{1L}$ and $M_{C}$ are symmetric. As previously noted, on parallel computers the number of submatrices usually matches the number of available processors. When the size of the system to be preconditioned is very large, the use of many processors, i.e.\ of many small submatrices, often leads to a large coarse-level system, whose solution may be computationally expensive. On the other hand, the use of few processors often leads to local sumatrices that are too expensive to be processed on single processors, because of memory and/or computing requirements. Therefore, it seems natural to use a recursive approach, in which the coarse-level correction is re-applied starting from the current coarse-level system. The corresponding preconditioners, called \emph{multi-level} preconditioners, can significantly reduce the computational cost of preconditioning with respect to the two-level case (see \cite[Chapter 3]{dd2_96}). Additive and hybrid multilevel preconditioners are obtained as direct extensions of the two-level counterparts. For a detailed descrition of them, the reader is referred to \cite[Chapter 3]{dd2_96}. The algorithm for the application of a multi-level hybrid post-smoothed preconditioner $M$ to a vector $v$, i.e.\ for the computation of $w=M^{-1}v$, is reported, for example, in Figure~\ref{fig:mlhpost_alg}. Here the number of levels is denoted by $nlev$ and the levels are numbered in increasing order starting from the finest one, i.e.\ the finest level is level 1; the coarse matrix and the corresponding basic preconditioner at each level $l$ are denoted by $A_l$ and $M_l$, respectively, with $A_1=A$. % \begin{figure}[t] \begin{center} \framebox{ \begin{minipage}{.85\textwidth} {\small \begin{tabbing} \quad \=\quad \=\quad \=\quad \\[-1mm] % %! assign the finest matrix\\ %$A_1 \leftarrow A$;\\[1mm] %! define the number of levels $nlev$ \\[1mm] %! define $nlev-1$ prolongators\\ %$R_l^T, l=2, \ldots, nlev$;\\[1mm] %! define $nlev-1$ coarser matrices\\ %$A_l \leftarrow R_lA_{l-1}R_l^T, \; l=2, \ldots, nlev$;\\[1mm] %! define the $nlev-1$ basic Schwarz preconditioners\\ %$M_l$, basic preconditioner for $A_l \; l=1, \ldots, nlev-1$;\\[1mm] %$! assign a vector $v$\\ % $v_1 = v$; \\[2mm] \textbf{for $l=2, nlev$ do}\\[1mm] \> ! transfer $v_{l-1}$ to the next coarser level\\ \> $v_l = R_lv_{l-1}$ \\[1mm] \textbf{endfor} \\[2mm] ! apply the coarsest-level correction\\[1mm] $y_{nlev} = A_{nlev}^{-1} v_{nlev}$\\[2mm] \textbf{for $l=nlev -1 , 1, -1$ do}\\[1mm] \> ! transfer $y_{l+1}$ to the next finer level\\ \> $y_l = R_{l+1}^T y_{l+1}$;\\[1mm] \> ! compute the residual at the current level\\ \> $r_l = v_l-A_l^{-1} y_l$;\\[1mm] \> ! apply the basic Schwarz preconditioner to the residual\\ \> $r_l = M_l^{-1} r_l$\\[1mm] \> ! update $y_l$\\ \> $y_l = y_l+r_l$\\ \textbf{endfor} \\[1mm] $w = y_1$; \end{tabbing} } \end{minipage} } \caption{Application of the multi-level hybrid post-smoothed preconditioner.\label{fig:mlhpost_alg}} \end{center} \end{figure} % \subsection{Smoothed Aggregation\label{sec:aggregation}} In order to define the restriction operator $R_C$, which is used to compute the coarse-level matrix $A_C$, MLD2P4 uses the \emph{smoothed aggregation} algorithm described in \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA}. The basic idea of this algorithm is to build a coarse set of vertices $W_C$ by suitably grouping the vertices of $W$ into disjoint subsets (aggregates), and to define the coarse-to-fine space transfer operator $R_C^T$ by applying a suitable smoother to a simple piecewise constant prolongation operator, to improve the quality of the coarse-space correction. Three main steps can be identified in the smoothed aggregation procedure: \begin{enumerate} \item coarsening of the vertex set $W$, to obtain $W_C$; \item construction of the prolongator $R_C^T$; \item application of $R_C$ and $R_C^T$ to build $A_C$. \end{enumerate} %\textbf{NOTA: Controllare cosa fa trilinos dopo il primo passo.} To perform the coarsening step, we have implemented the aggregation algorithm sketched in \cite{apnum_07}. According to \cite{VANEK_MANDEL_BREZINA}, a modification of this algorithm has been actually considered, in which each aggregate $N_r$ is made of vertices of $W$ that are \emph{strongly coupled} to a certain root vertex $r \in W$, i.e.\ \[ N_r = \left\{s \in W: |a_{rs}| > \theta \sqrt{|a_{rr}a_{ss}|} \right\} \cup \left\{ r \right\} , \] for a given $\theta \in [0,1]$. Since this algorithm has a sequential nature, a \emph{decoupled} version of it has been chosen, where each processor $i$ independently applies the algorithm to the set of vertices $W_i^0$ assigned to it in the initial data distribution. This version is embarrassingly parallel, since it does not require any data communication. On the other hand, it may produce non-uniform aggregates near boundary vertices, i.e.\ near vertices adjacent to vertices in other processors, and is strongly dependent on the number of processors and on the initial partitioning of the matrix $A$. Nevertheless, this algorithm has been chosen for the implementation in MLD2P4, since it has been shown to produce good results in practice \cite{aaecc_07,apnum_07,TUMINARO_TONG}. The prolongator $P_C=R_C^T$ is built starting from a \emph{tentative prolongator} $P \in \Re^{n \times n_C}$, defined as \begin{equation} P=(p_{ij}), \quad p_{ij}= \left\{ \begin{array}{ll} 1 & \quad \mbox{if} \; i \in V^j_C \\ 0 & \quad \mbox{otherwise} \end{array} \right. . \label{eq:tent_prol} \end{equation} $P_C$ is obtained by applying to $P$ a smoother $S \in \Re^{n \times n}$: \begin{equation} P_C = S P, \label{eq:smoothed_prol} \end{equation} in order to remove oscillatory components from the range of the prolongator and hence to improve the convergence properties of the multi-level Schwarz method \cite{BREZINA_VANEK,StubenGMD69_99}. A simple choice for $S$ is the damped Jacobi smoother: \begin{equation} S = I - \omega D^{-1} A , \label{eq:jac_smoother} \end{equation} where the value of $\omega$ can be chosen using some estimate of the spectral radius of $D^{-1}A$ \cite{BREZINA_VANEK}. % %\textbf{NOTA: filtering di $A$ nello smoothing, da implementare?} % %%% Local Variables: %%% mode: latex %%% TeX-master: "userguide" %%% End: