amg4psblas/docs/pdf/background.tex

\section{Multi-level Domain Decomposition Background\label{sec:background}}

\emph{Domain Decomposition} (DD) preconditioners, coupled with Krylov iterative
solvers, are widely used in the parallel solution of large and sparse linear systems.
These preconditioners are based on the divide and conquer technique: the matrix
to be preconditioned is divided into submatrices, a ``local linear system''
involving each submatrix is (approximately) solved, and the local solutions are used
to build a preconditioner for the whole original matrix. This process
often corresponds to dividing a physical domain associated to the original matrix
into subdomains, e.g. in a PDE discretization, to (approximately) solving the
subproblems corresponding to the subdomains and to building an approximate
solution of the original problem from the local solutions 
\cite{Cai_Widlund_92,dd1_94,dd2_96}. 

\emph{Additive Schwarz} preconditioners are DD preconditioners using overlapping
submatrices, i.e.\ with some common rows, to couple the local information
related to the submatrices (see, e.g., \cite{dd2_96}).
The main motivations for choosing Additive Schwarz preconditioners are their
intrinsic parallelism. A drawback of these
preconditioners is that the number of iterations of the preconditioned solvers
generally grows with the number of submatrices. This may be a serious limitation
on parallel computers, since the number of submatrices usually matches the number
of available processors. Optimal convergence rates, i.e.\ iteration numbers
independent of the number of submatrices, can be obtained by correcting the
preconditioner through a suitable approximation of the original linear system
in a coarse space, which globally couples the information related to the single
submatrices. 

\emph{Two-level Schwarz} preconditioners are obtained
by combining basic (one-level) Schwarz preconditioners with coarse-level
corrections. In this context, the one-level preconditioner is often
called smoother. Different two-level preconditioners are obtained by varying the
choice of the smoother, of the coarse-level correction and the
way they are combined \cite{dd2_96}. The same reasoning can be applied starting
from the coarse-level system, i.e.\ a coarse-space correction can be built
from this system, thus obtaining \emph{multi-level} preconditioners.

It is worth noting that optimal preconditioners do not necessarily correspond
to minimum execution times. Indeed, to obtain effective multilevel preconditioners
a tradeoff between optimality of convergence and the cost of building and applying
the coarse-space corrections must be achieved. The choice of the number of levels,
i.e.\ of the coarse-space corrections, also affects the effectiveness of the
preconditioners. One more goal is to get convergence rates as less sensitive
as possible to variations in the matrix coefficients.

Two main approaches can be used to build coarse-space corrections. The geometric approach
applies coarsening strategies based on the knowledge of some physical grid associated
to the matrix and requires the user to define grid transfer operators from the fine
to the coarse levels and vice versa. This may result difficult for complex geometries;
furthermore, suitable one-level preconditioners may be required to get efficient
interplay between fine and coarse levels, e.g.\ when matrices with highly varying coefficients
are considered. The algebraic approach builds coarse-space corrections using only matrix
information. It performs a fully automatic coarsening and enforces the interplay between
the fine and coarse levels by suitably choosing the coarse space and the coarse-to-fine
interpolation \cite{StubenGMD69_99}.

MLD2P4 uses a pure algebraic approach for building the sequence of coarse matrices
starting from the original matrix. The algebraic approach is based on the \emph{smoothed 
aggregation} algorithm \cite{Brezina_Vanek_,Vanek_Mandel_Brezina_}. A decoupled version
of this algorithm is implemented, where the smoothed aggregation is applied locally
to each submatrix \cite{Tuminaro_Tong_00}. In the next two subsections we provide
a brief description of the multi-level Schwarz preconditioners and on the smoothed
aggregation technique as implemented in MLD2P4. For further details the user
is referred to \cite{para_04,apnum_07,aaecc_07,dd2_96}.


\subsection{Multi-level Schwarz Preconditioners\label{sec:multilevel}}

The Multilevel preconditioners implemented in MLD2P4 are obtained by combining
AS preconditioners with coarse-space corrections; therefore
we first provide a sketch of the AS preconditioners.

Given the linear system \Ref{system1},
where $A=(a_{ij}) \in \Re^{n \times n}$ is a
nonsingular sparse matrix with a symmetric non-zero pattern,
let $G=(W,E)$ be the adjacency graph of $A$, where $W=\{1, 2, \ldots, n\}$
and $E=\{(i,j) : a_{ij} \neq 0\}$ are the vertex set and the edge set of $G$,
respectively. Two vertices are called adjacent if there is an edge connecting
them. For any integer $\delta > 0$, a $\delta$-overlap
partition of $W$ can be defined recursively as follows.
Given a 0-overlap (or non-overlapping) partition of $W$,
i.e.\ a set of $m$ disjoint nonempty sets $W_i^0 \subset W$ such that
$\cup_{i=1}^m W_i^0 = W$, a $\delta$-overlap
partition of $W$ is obtained by considering the sets
$W_i^\delta \supset W_i^{\delta-1}$, obtained by including the vertices that
are adjacent to any vertex in $W_i^{\delta-1}$.

Let $n_i^\delta$ be the size of $W_i^\delta$ and $R_i^{\delta} \in 
\Re^{n_i^\delta \times n}$ the restriction operator that maps
a vector $v \in \Re^n$ onto the vector $v_i^{\delta} \in \Re^{n_i^\delta}$
containing the components of $v$ corresponding to the vertices in
$W_i^\delta$. The transpose of $R_i^{\delta}$ is a
prolongation operator from $\Re^{n_i^\delta}$ to $\Re^n$.
The matrix $A_i^\delta=R_i^\delta A (R_i^\delta)^T \in
\Re^{n_i^\delta \times n_i^\delta}$ can be considered
as a restriction of $A$ corresponding to the set $W_i^{\delta}$.

The \emph{classical one-level AS} preconditioner is defined by
\[
M_{AS}^{-1}= \sum_{i=1}^m (R_i^{\delta})^T 
(A_i^\delta)^{-1} R_i^{\delta},
\]
where $A_i^\delta$ is assumed to be nonsingular. Its application
to a vector $v \in \Re^n$ within a Krylov solver requires the following
three steps:
\begin{enumerate}
	\item restriction of $v$ as $v_i = R_i^{\delta} v$, $i=1,\ldots,m$;
	\item (approximate) solution of the linear systems $A_i^\delta w_i = v_i$,
	      $i=1,\ldots,m$;
	\item prolongation and sum of the $w_i$'s, i.e. $w = \sum_{i=1}^m (R_i^{\delta})^T w_i$.
\end{enumerate}
A variant of the classical AS preconditioner that outperforms it
in terms of both convergence rate and of computation and communication
time on parallel distributed-memory computers is the so-called \emph{Restricted AS
(RAS)} preconditioner~\cite{Cai_Sarkis,Efstathiou_Gander}. It
is obtained by zeroing the components of $w_i$ corresponding to the
overlapping vertices when applying the prolongation. Therefore,
RAS differs from classical AS by the prolongation operators,
which are substituted by $(\tilde{R}_i^0)^T \in \Re^{n_i^\delta \times n}$,
where $\tilde{R}_i^0$ is obtained by zeroing the rows of $R_i^\delta$
corresponding to the vertices in $W_i^\delta \backslash W_i^0$:
\[
M_{RAS}^{-1}= \sum_{i=1}^m (\tilde{R}_i^0)^T 
(A_i^\delta)^{-1} R_i^{\delta}.
\]
Analogously, the AS variant called \emph{AS with Harmonic extension (ASH)}
is defined by
\[ M_{ASH}^{-1}= \sum_{i=1}^m (R_i^{\delta})^T 
(A_i^\delta)^{-1} \tilde{R}_i^0.
\]
We note that for $\delta=0$ the three variants of the AS preconditioner are
all equal to the block-Jacobi preconditioner.

As already observed, the convergence rate of the one-level Schwarz
preconditioned iterative solvers deteriorates as the number $m$ of partitions
of $W$ increases \cite{dd1_94,dd2_96}. To reduce the dependency
of the number of iterations on the degree of parallelism we may
introduce a global coupling among the overlapping partitions by defining 
a coarse-space approximation $A_C$ of the matrix $A$. 
In a pure algebraic setting, $A_C$ is usually built with
a Galerkin approach. Given a set $W_C$ of \emph{coarse vertices},
with size $n_C$, and a suitable restriction operator
$R_C \in \Re^{n_C \times n}$, $A_C$ is defined as
\[
A_C=R_C A R_C^T
\]
and the coarse-level correction matrix to be combined with a generic
one-level AS preconditioner $M_{1L}$ is obtained as
\[
M_{C}^{-1}= R_C^T A_C^{-1} R_C,
\]
where $A_C$ is assumed to be nonsingular. The application of $M_{C}^{-1}$
to a vector $v$ corresponds to a restriction, a solution and
a prolongation step; the solution step, involving the matrix $A_C$,
may be carried out also approximately.

The combination of $M_{C}$ and $M_{1L}$ may be
performed in either an additive or a multiplicative framework.
In the former case, the \emph{two-level additive} Schwarz preconditioner
is obtained:
\[
M_{2LA}^{-1} = M_{C}^{-1} + M_{1L}^{-1}. 
\]
Applying $M_{2L-A}^{-1}$ to a vector $v$ within a Krylov solver
corresponds to applying $M_{C}^{-1}$
and $M_{1L}^{-1}$ to $v$ independently and then summing up
the results.

In the multiplicative case, the combination can be
performed by first applying the smoother $M_{1L}^{-1}$ and then
the coarse-level correction operator $M_{C}^{-1}$:
\[
\begin{array}{l}
w = M_{1L}^{-1} v, \\
z = w + M_{C}^{-1} (v-Aw);
\end{array}
\]
this corresponds to the following \emph{two-level hybrid pre-smoothed}
Schwarz preconditioner:
\[
M_{2LH-PRE}^{-1} = M_{C}^{-1} + \left( I - M_{C}^{-1}A \right) M_{1L}^{-1}. 
\]
On the other hand, by applying the smoother after the coarse-level correction,
i.e.\ by computing
\[
\begin{array}{l}
w = M_{C}^{-1} v , \\
z = w + M_{1L}^{-1} (v-Aw) , 
\end{array}
\]
the \emph{two-level hybrid post-smoothed}
Schwarz preconditioner is obtained:
\[
M_{2LH-POST}^{-1} = M_{1L}^{-1} + \left( I - M_{1L}^{-1}A \right) M_{C}^{-1}. 
\]
One more variant of two-level hybrid preconditioner is obtained by applying
the smoother before and after the coarse-level correction. In this case, the
preconditioner is symmetric if $A$, $M_{1L}$ and $M_{C}$ are symmetric.

As previously noted, on parallel computers the number of sumatrices usually matches
the number of available processors. When the size of the system to be preconditioned
is very large, the use of many processors, i.e.\ of many small submatrices, often
leads to a large coarse-level system, whose solution may be computationally expensive.
On the other hand, the use of few processors often leads to local sumatrices that
are too expensive to be processed on single processors, because of memory and/or
computing requirements. Therefore, it seems natural to use a recursive approach,
in which the coarse-level correction is re-applied starting from the current
coarse-level system. The corresponding preconditioners are called \emph{multi-level}.
One more reason for the multi-level approach is that it may significantly
reduce the computational cost of preconditioning with respect to the two-level case
(see \cite[Chapter 3]{dd2_96}). Additive and hybrid multilevel preconditioners
are obtained as direct extensions of the two-level counterparts. Other combinations
of the smoothers and coarse-level corrections are possible, leading to variants
of the previous algorithms. For a detailed descrition of them, the reader is
referred to \cite[Chapter 3]{dd2_96}.
\textbf{DESCRIZIONE ALGORITMICA, a titolo di esempio,
di un precondizionatore multilevel, ad esempio quello ibrido con pre- o post-smoothing,
sul tipo della descrizione in figura 1 della guida di Trilinos ML 4.0.}


\subsection{Smoothed Aggregation\label{sec:aggregation}}

To define the restriction operator $R_C$, which is used to compute
the coarse-level matrix $A_C$, MLD2P4 uses the \emph{smoothed aggregation}
algorithm described in \cite{Brezina_Vanek_,Vanek_Mandel_Brezina_}.
The basic idea of this algorithm is to build a coarse set of vertices
$W_C$ by suitably grouping the vertices of $W$ into disjoint subsets
(aggregates), and to define the coarse-to-fine space transfer operator $R_C^T$ by
applying a suitable smoother to a simple piecewise constant
prolongation operator, to improve the quality of the coarse-space correction.

Three main steps can be identified in the smoothed aggregation procedure:
\begin{enumerate}
	\item coarsening of the vertex set $W$, to obtain $W_C$;
	\item construction of the prolongator $R_C^T$;
	\item application of $R_C$ and $R_C^T$ to build $A_C$.
\end{enumerate}
%\textbf{NOTA: Controllare cosa fa trilinos dopo il primo passo.}
 
To perform the coarsening step, we have implemented the aggregation algorithm sketched
in \cite{apnum_07}. According to \cite{brezina_vanek}, a modification of this algorithm
has been actually considered,
in which each aggregate $N_r$ is made of vertices of $W$ that are \emph{strongly coupled}
to a certain root vertex $r \in W$, i.e.\
\[  N_r = \left\{s \in W: |a_{rs}| > \theta \sqrt{|a_{rr}a_{ss}|} \right\}
    \cup \left\{ r \right\} ,
\]
for a given $\theta \in [0,1]$. \textbf{L'ALGORITMO USA IL MAGGIORE STRETTO? ALTRIMENTI
CON THETA=0 AGGREGHIAMO ANCHE I VERTICI CON COEFFICIENTE CORRISPONDENTE NULLO}
Since the previous algorithm has a sequential nature, a \emph{decoupled} version of
it has been chosen, where each processor $i$ independently applies the algorithm to
the set of vertices $W_i^0$ assigned to it in the initial data distribution. This
version is embarrassingly parallel, since it does not require any data communication.
On the other hand, it may produce non-uniform aggregates near boundary vertices,
i.e.\ near vertices adjacent to vertices in other processors, and is strongly
dependent on the number of processors and on the initial partitioning of the matrix $A$.
Nevertheless, this algorithm has been chosen for the implementation in MLD2P4,
since it has been shown to produce good results in practice
\cite{Tuminaro_Tong_00,apnum_07,aaecc_07}.

The prolongator $P_C=R_C^T$ is built starting from a \emph{tentative prolongator}
$P \in \Re^{n \times n_C}$, defined as
\begin{equation} 
P=(p_{ij}), \quad  p_{ij}= 
\left\{ \begin{array}{ll}
1 & \quad \mbox{if} \; i \in V^j_C \\
0 & \quad \mbox{otherwise}
\end{array} \right. .
\label{eq:tent_prol}
\end{equation}
$P_C$ is obtained by
applying to $P$ a smoother $S \in \Re^{n \times n}$:
\begin{equation}
P_C = S P,
\label{eq:smoothed_prol}
\end{equation}
in order to remove oscillatory components from the range of the prolongator
and hence to improve the convergence properties of the multi-level
Schwarz method \cite{Brezina_Vanek_,StubenGMD69_99}.
A simple choice for $S$ is the damped Jacobi smoother:
\begin{equation}
S = I - \omega D^{-1} A , 
\label{eq:jac_smoother}
\end{equation}
where the value of $\omega$ can be chosen
using some estimate of the spectral radius of $D^{-1}A$ \cite{Brezina_Vanek}.
%
%\textbf{NOTA: filtering di $A$ nello smoothing, da implementare?}
%

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "userguide"
%%% End: