From 6097a8fbb61439e0310cbcd6e36c9c7693fe8bd2 Mon Sep 17 00:00:00 2001
From: Salvatore Filippone <salvatore.filippone@cranfield.ac.uk>
Date: Mon, 17 Jul 2017 15:24:32 +0100
Subject: [PATCH] New docs for release: merged Daniela's comments.

---
 docs/src/background.tex     |   2 +-
 docs/src/bibliography.tex   |  86 ++++----
 docs/src/building.tex       |  50 +++--
 docs/src/distribution.tex   |   1 -
 docs/src/gettingstarted.tex |  36 ++--
 docs/src/intro.tex          |  34 ----
 docs/src/newobjects.tex     |  85 ++++++++
 docs/src/overview.tex       |   6 +-
 docs/src/precs.tex          | 384 ------------------------------------
 docs/src/userguide.tex      |   1 +
 docs/src/userhtml.tex       |   3 +-
 docs/src/userinterface.tex  | 188 ++++++------------
 12 files changed, 246 insertions(+), 630 deletions(-)
 delete mode 100644 docs/src/intro.tex
 create mode 100644 docs/src/newobjects.tex
 delete mode 100644 docs/src/precs.tex

diff --git a/docs/src/background.tex b/docs/src/background.tex
index 8b600155..d496242f 100644
--- a/docs/src/background.tex
+++ b/docs/src/background.tex
@@ -1 +1 @@
-\section{Multigrid Background\label{sec:background}}\markboth{\textsc{MLD2P4 User's and Reference Guide}}         {\textsc{\ref{sec:background} Multigrid Background}}Multigrid preconditioners, coupled with Krylov iterativesolvers, are widely used in the parallel solution of large and sparse linear systems,because of their optimality in the solution of linear systems arising from thediscretization of scalar elliptic Partial Differential Equations (PDEs) on regular grids.Optimality, also known as algorithmic scalability, is the property of having a computational cost per iteration that depends linearly onthe problem size, and a convergence rate that is independent of the problem size.Multigrid preconditioners are based on a recursive application of a two-grid processconsisting of smoother iterations and a coarse-space (or coarse-level) correction.The smoothers may be either basic iterative methods, such as the Jacobi and Gauss-Seidel ones,or more complex subspace-correction methods, such as the Schwarz ones.The coarse-space correction consists of solving, in an appropriately chosencoarse space, the residual equation associated with the approximate solution computedby the smoother, and of using the solution of this equation to correct theprevious approximation. The transfer of information between the original(fine) space and the coarse one is performed by using suitable restriction andprolongation operators. The construction of the coarse space and the correspondingtransfer operators is carried out by applying a so-called coarsening algorithm to the systemmatrix. Two main approaches can be used to perform coarsening: the geometric approach,which exploits the knowledge of some physical grid associated with the matrixand requires the user to define transfer operators from the fineto the coarse level and vice versa, and the algebraic approach, which buildsthe coarse-space correction and the associate transfer operators using only matrixinformation. The first approach may be difficult when the system comes fromdiscretizations on complex geometries;furthermore, ad hoc one-level smoothers may be required to get an efficientinterplay between fine and coarse levels, e.g., when matrices with highly varying coefficientsare considered. The second approach performs a fully automatic coarsening and enforces theinterplay between fine and coarse level by suitably choosing the coarse space andthe coarse-to-fine interpolation (see, e.g., \cite{Briggs2000,Stuben_01,dd2_96} for details.)MLD2P4 uses a pure algebraic approach, based on the smoothed aggregation algorithm \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA},for building the sequence of coarse matrices and transfer operators,starting from the original one.A decoupled version of this algorithm is implemented, where the smoothedaggregation is applied locally to each submatrix \cite{TUMINARO_TONG}.A brief description of the AMG preconditioners implemented in MLD2P4 is given in Sections~\ref{sec:multilevel}-\ref{sec:smoothers}. For further details the readeris referred to \cite{para_04,aaecc_07,apnum_07,MLD2P4_TOMS}.We note that optimal multigrid preconditioners do not necessarily correspondto minimum execution times in a parallel setting. Indeed, to obtain effective parallelmultigrid preconditioners, a tradeoff between the optimality and the cost of building andapplying the smoothers and the coarse-space corrections must be achieved. Effectiveparallel preconditioners require algorithmic scalability to be coupled with implementationscalability, i.e., a computational cost per iteration which remains (almost) constant asthe number of parallel processors increases.\subsection{AMG preconditioners\label{sec:multilevel}}In order to describe the AMG preconditioners available in MLD2P4, we consider alinear system\begin{equation}Ax=b, \label{eq:system}\end{equation}where $A=(a_{ij}) \in \mathbb{R}^{n \times n}$ is a nonsingular sparse matrix;for ease of presentation we assume $A$ is real, but theresults are valid for the complex case as well. Let us assume as finest index space the set of row (column) indices of $A$, i.e.,$\Omega = \{1, 2, \ldots, n\}$. Any algebraic multilevel preconditioners implemented in MLD2P4 generatesa hierarchy of index spaces and a corresponding hierarchy of matrices,\[ \Omega^1 \equiv \Omega \supset \Omega^2 \supset \ldots \supset \Omega^{nlev},\quad A^1 \equiv A, A^2, \ldots, A^{nlev}, \]by using the information contained in $A$, without assuming anyknowledge of the geometry of the problem from which $A$ originates.A vector space $\mathbb{R}^{n_{k}}$ is associated with $\Omega^k$,where $n_k$ is the size of $\Omega^k$.For all $k < nlev$, a restriction operator and a prolongation one are built,which connect two levels $k$ and $k+1$:$$    P^k \in \mathbb{R}^{n_k \times n_{k+1}}, \quad     R^k \in \mathbb{R}^{n_{k+1}\times n_k};$$%\[ %   P^k:  \mathbb{R}^{n_{k+1}} \longrightarrow \mathbb{R}^{n_k}, \quad%   R^k: \mathbb{R}^{n_k} \longrightarrow \mathbb{R}^{n_{k+1}};%\]the matrix $A^{k+1}$ is computed by using the previous operators accordingto the Galerkin approach, i.e.,$$  A^{k+1}=R^kA^kP^k.$$$R^k=(P^k)^T$ in the current implementation of MLD2P4.A smoother with iteration matrix $M^k$ is set up at each level $k < nlev$, and a solveris set up at the coarsest level, so that they are ready for application (for example, setting up a solver based on the $LU$ factorization means computingand storing the $L$ and $U$ factors). The construction of the hierachy of AMG componentsdescribed so far corresponds to the so-called build phase of the preconditioner.\begin{figure}[t]\begin{center} \framebox{\begin{minipage}{.85\textwidth}\begin{tabbing}\quad \=\quad \=\quad \=\quad \\[-3mm]procedure V-cycle$\left(k,A^k,b^k,u^k\right)$ \\[2mm]\>if $\left(k \ne nlev \right)$ then \\[1mm]\>\> $u^k = u^k + M^k \left(b^k - A^k u^k\right)$ \\[1mm]\>\> $b^{k+1} = R^{k+1}\left(b^k - A^k u^k\right)$ \\[1mm]\>\> $u^{k+1} =$ V-cycle$\left(k+1,A^{k+1},b^{k+1},0\right)$ \\[1mm]\>\> $u^k = u^k + P^{k+1} u^{k+1}$ \\[1mm]\>\> $u^k = u^k + M^k \left(b^k - A^k u^k\right)$ \\[1mm]\>else \\[1mm]\>\> $u^k = \left(A^k\right)^{-1} b^k$\\[1mm]\>endif \\[1mm]\>return $u^k$ \\[1mm]end\end{tabbing}\end{minipage}}\caption{Application phase of a V-cycle preconditioner.\label{fig:application_alg}}\end{center}\end{figure}The components produced in the build phase may be combined in several waysto obtain different multilevel preconditioners;this is  done in the application phase, i.e., in the computation of a vectorof type $w=B^{-1}v$, where $B$ denotes the preconditioner, usually within an iterationof a Krylov solver \cite{Saad_book}. An example of such a combination, known asV-cycle, is given in Figure~\ref{fig:application_alg}. In this case, a single iterationof the same smoother is used before and after the the recursive call to the V-cycle (i.e.,in the pre-smoothing and post-smoothing phases); however, different choices can beperformed. Other cycles can be defined; in MLD2P4, we implemented the standard V-cycleand W-cycle~\cite{Briggs2000}, and a version of the K-cycle described in~\cite{Notay2008}. \subsection{Smoothed Aggregation\label{sec:aggregation}}In order to define the prolongator $P^k$, used to computethe coarse-level matrix $A^{k+1}$, MLD2P4 uses the smoothed aggregationalgorithm described in \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA}.The basic idea of this algorithm is to build a coarse set of indices$\Omega^{k+1}$ by suitably grouping the indices of $\Omega^k$ into disjointsubsets (aggregates), and to define the coarse-to-fine space transfer operator$P^k$ by applying a suitable smoother to a simple piecewise constantprolongation operator, with the aim of improving the quality of the coarse-space correction.Three main steps can be identified in the smoothed aggregation procedure:\begin{enumerate}        \item aggregation of the indices set $\Omega^k$, to obtain $\Omega^{k+1}$;        \item construction of the prolongator $P^k$;        \item application of $P^k$ and $R^k=(P^k)^T$ to build $A^{k+1}$.\end{enumerate} In order to perform the coarsening step, the smoothed aggregation algorithmdescribed in~\cite{VANEK_MANDEL_BREZINA} is used. In this algorithm,each index in $\Omega^{k+1}$ corresponds to an aggregate of $\Omega^k$,consisting of a suitably chosen index $j$ and of the indices $i$ that are stronglycoupled to $j$, i.e., $$    |a_{ij}^k| > \theta \sqrt{|a_{ii}^ka_{jj}^k|},$$for a given $\theta \in [0,1]$. Since this algorithm has a sequential nature, a decoupledversion of it is applied, where each processor $i$ independently executesthe algorithm on the set of indices assigned to it in the initial datadistribution. This version is embarrassingly parallel, since it does not require any data communication. On the other hand, it may produce some non-uniform aggregatesand is strongly dependent on the number of processors and on the initial partitioningof the matrix $A$. Nevertheless, this parall algorithm has been chosen forMLD2P4, since it has been shown to produce good results in practice\cite{aaecc_07,apnum_07,TUMINARO_TONG}.The prolongator $P^k$ is built starting from a tentative prolongator$\bar{P}^k \in \mathbb{R}^{n_k \times n_{k+1}}$, defined as$$\bar{P}^k =(\bar{p}_{ij}^k), \quad  \bar{p}_{ij}^k = \left\{ \begin{array}{ll}1 & \quad \mbox{if} \; i \in \Omega^k_j, \\0 & \quad \mbox{otherwise},\end{array} \right.\label{eq:tent_prol}$$where $\Omega^k_j$ is the aggregate of $\Omega^k$corresponding to the index $j \in \Omega^{k+1}$.$P^k$ is obtained by applying to $\bar{P}^k$ a smoother$S^k \in \mathbb{R}^{n_k \times n_k}$:$$P^k = S^k \bar{P}^k,$$in order to remove nonsmooth components from the range of the prolongator,and hence to improve the convergence properties of the multi-levelmethod~\cite{BREZINA_VANEK,Stuben_01}.A simple choice for $S^k$ is the damped Jacobi smoother:$$S^k = I - \omega^k (D^k)^{-1} A^k , $$where $D^k$ is the diagonal matrix with the same diagonal entries as $A^k$,and $\omega^k$ is an approximation of $4/(3\rho^k)$, where$\rho^k$ is the spectral radius of $(D^k)^{-1}A^k$.computed by using some estimate of the spectral radius of $(D^k)^{-1}A^k$\cite{BREZINA_VANEK}.\subsection{Smoothers and coarsest-level solvers\label{sec:smoothers}}The smoothers implemented in MLD2P4 include the Jacobi and block-Jacobi methods,a hybrid version of the forward and backward Gauss-Seidel methods, and theadditive Schwarz (AS) ones (see, e.g., \cite{Saad_book,dd2_96}). The hybrid Gauss-Seidelversion is considered because the original Gauss-Seidel method is inherently sequential.At each iteration of the hybrid version, each parallel process uses the most recent valuesof its own local variables and the values of the non-local variables computed at theprevious iteration, obtained by exchanging data with other processes beforethe beginning of the current iteration.In the AS methods, the index space $\Omega^k$ is divided into $m_k$subsets $\Omega^k_i$ of size $n_{k,i}$,  possiblyoverlapping. For each $i$ we consider the restrictionoperator $R_i^k \in \mathbb{R}^{n_{k,i} \times n_k}$% $R_i^k: \mathbb{R}^{n_k} \longrightarrow \mathbb{R}^{n_{k,i}}$that maps a vector $x^k$ to the vector $x_i^k$ made of the components of $x^k$with indices in $\Omega^k_i$, and the prolongation operator$P^k_i = (R_i^k)^T$. These operators are then  used to build$A_i^k=R_i^kA^kP_i^k$, which is the restriction of $A^k$ to the indexspace $\Omega^k_i$.The classical AS preconditioner $M^k_{AS}$ is defined as\[    ( M^k_{AS} )^{-1} = \sum_{i=1}^{m_k} P_i^k (A_i^k)^{-1} R_i^{k},\]where $A_i^k$ is supposed to be nonsingular. We observe that an approximateinverse of $A_i^k$ is usually considered instead of $(A_i^k)^{-1}$.The setup of $S^k_{AS}$ during the multilevel build phaseinvolves\begin{itemize}  \item the definition of the index subspaces $\Omega_i^k$ and of the corresponding   operators $R_i^k$ (and $P_i^k$);  \item the computation of the submatrices $A_i^k$;  \item the computation of their inverses (usually approximated    through  some form  of incomplete factorization).\end{itemize}The computation of $z^k=M^k_{AS}w^k$, with $w^k \in \mathbb{R}^{n_k}$, during themultilevel application phase, requires\begin{itemize}	\item the restriction of $w^k$ to the subspaces $\mathbb{R}^{n_{k,i}}$,	  i.e.\ $w_i^k = R_i^{k} w^k$;	\item the computation of the vectors $z_i^k=(A_i^k)^{-1} w_i^k$;	\item the prolongation and the sum of the previous vectors,    i.e.\ $z^k = \sum_{i=1}^{m_k} P_i^k z_i^k$.\end{itemize}Variants of the classical AS method, which use modifications of therestriction and prolongation operators, are also implemented in MLD2P4.Among them, the Restricted AS (RAS) preconditioner usuallyoutperforms the classical AS preconditioner in terms of convergencerate and of computation and communication time on parallel distributed-memorycomputers, and is therefore the most widely used among the ASpreconditioners~\cite{CAI_SARKIS}. Direct solvers based on sparse LU factorizations, implemented in thethird party libraries reported in Section~\ref{sec:third_party}, can be appliedas coarsest-level solvers by MLD2P4. Native inexact solvers based onincomplete LU factorizations, as well as Jacobi, hybrid (forward) Gauss-Seidel,and block Jacobi preconditioners are also available. Direct solvers usuallylead to more effective preconditioners in terms of algorithmic scalability;however, this does not guarantee parallel efficiency.%%% Local Variables: %%% mode: latex%%% TeX-master: "userguide"%%% End: 
\ No newline at end of file
+\section{Multigrid Background\label{sec:background}}\markboth{\textsc{MLD2P4 User's and Reference Guide}}         {\textsc{\ref{sec:background} Multigrid Background}}Multigrid preconditioners, coupled with Krylov iterativesolvers, are widely used in the parallel solution of large and sparse linear systems,because of their optimality in the solution of linear systems arising from thediscretization of scalar elliptic Partial Differential Equations (PDEs) on regular grids.Optimality, also known as algorithmic scalability, is the property of having a computational cost per iteration that depends linearly onthe problem size, and a convergence rate that is independent of the problem size.Multigrid preconditioners are based on a recursive application of a two-grid processconsisting of smoother iterations and a coarse-space (or coarse-level) correction.The smoothers may be either basic iterative methods, such as the Jacobi and Gauss-Seidel ones,or more complex subspace-correction methods, such as the Schwarz ones.The coarse-space correction consists of solving, in an appropriately chosencoarse space, the residual equation associated with the approximate solution computedby the smoother, and of using the solution of this equation to correct theprevious approximation. The transfer of information between the original(fine) space and the coarse one is performed by using suitable restriction andprolongation operators. The construction of the coarse space and the correspondingtransfer operators is carried out by applying a so-called coarsening algorithm to the systemmatrix. Two main approaches can be used to perform coarsening: the geometric approach,which exploits the knowledge of some physical grid associated with the matrixand requires the user to define transfer operators from the fineto the coarse level and vice versa, and the algebraic approach, which buildsthe coarse-space correction and the associate transfer operators using only matrixinformation. The first approach may be difficult when the system comes fromdiscretizations on complex geometries;furthermore, ad hoc one-level smoothers may be required to get an efficientinterplay between fine and coarse levels, e.g., when matrices with highly varying coefficientsare considered. The second approach performs a fully automatic coarsening and enforces theinterplay between fine and coarse level by suitably choosing the coarse space andthe coarse-to-fine interpolation (see, e.g., \cite{Briggs2000,Stuben_01,dd2_96} for details.)MLD2P4 uses a pure algebraic approach, based on the smoothed aggregation algorithm \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA},for building the sequence of coarse matrices and transfer operators,starting from the original one.A decoupled version of this algorithm is implemented, where the smoothedaggregation is applied locally to each submatrix \cite{TUMINARO_TONG}.A brief description of the AMG preconditioners implemented in MLD2P4 is given in Sections~\ref{sec:multilevel}-\ref{sec:smoothers}. For further details the readeris referred to \cite{para_04,aaecc_07,apnum_07,MLD2P4_TOMS}.We note that optimal multigrid preconditioners do not necessarily correspondto minimum execution times in a parallel setting. Indeed, to obtain effective parallelmultigrid preconditioners, a tradeoff between the optimality and the cost of building andapplying the smoothers and the coarse-space corrections must be achieved. Effectiveparallel preconditioners require algorithmic scalability to be coupled with implementationscalability, i.e., a computational cost per iteration which remains (almost) constant asthe number of parallel processors increases.\subsection{AMG preconditioners\label{sec:multilevel}}In order to describe the AMG preconditioners available in MLD2P4, we consider alinear system\begin{equation}Ax=b, \label{eq:system}\end{equation}where $A=(a_{ij}) \in \mathbb{R}^{n \times n}$ is a nonsingular sparse matrix;for ease of presentation we assume $A$ is real, but theresults are valid for the complex case as well. Let us assume as finest index space the set of row (column) indices of $A$, i.e.,$\Omega = \{1, 2, \ldots, n\}$. Any algebraic multilevel preconditioners implemented in MLD2P4 generatesa hierarchy of index spaces and a corresponding hierarchy of matrices,\[ \Omega^1 \equiv \Omega \supset \Omega^2 \supset \ldots \supset \Omega^{nlev},\quad A^1 \equiv A, A^2, \ldots, A^{nlev}, \]by using the information contained in $A$, without assuming anyknowledge of the geometry of the problem from which $A$ originates.A vector space $\mathbb{R}^{n_{k}}$ is associated with $\Omega^k$,where $n_k$ is the size of $\Omega^k$.For all $k < nlev$, a restriction operator and a prolongation one are built,which connect two levels $k$ and $k+1$:$$    P^k \in \mathbb{R}^{n_k \times n_{k+1}}, \quad     R^k \in \mathbb{R}^{n_{k+1}\times n_k};$$%\[ %   P^k:  \mathbb{R}^{n_{k+1}} \longrightarrow \mathbb{R}^{n_k}, \quad%   R^k: \mathbb{R}^{n_k} \longrightarrow \mathbb{R}^{n_{k+1}};%\]the matrix $A^{k+1}$ is computed by using the previous operators accordingto the Galerkin approach, i.e.,$$  A^{k+1}=R^kA^kP^k.$$$R^k=(P^k)^T$ in the current implementation of MLD2P4.A smoother with iteration matrix $M^k$ is set up at each level $k < nlev$, and a solveris set up at the coarsest level, so that they are ready for application (for example, setting up a solver based on the $LU$ factorization means computingand storing the $L$ and $U$ factors). The construction of the hierarchy of AMG componentsdescribed so far corresponds to the so-called build phase of the preconditioner.\begin{figure}[t]\begin{center} \framebox{\begin{minipage}{.85\textwidth}\begin{tabbing}\quad \=\quad \=\quad \=\quad \\[-3mm]procedure V-cycle$\left(k,A^k,b^k,u^k\right)$ \\[2mm]\>if $\left(k \ne nlev \right)$ then \\[1mm]\>\> $u^k = u^k + M^k \left(b^k - A^k u^k\right)$ \\[1mm]\>\> $b^{k+1} = R^{k+1}\left(b^k - A^k u^k\right)$ \\[1mm]\>\> $u^{k+1} =$ V-cycle$\left(k+1,A^{k+1},b^{k+1},0\right)$ \\[1mm]\>\> $u^k = u^k + P^{k+1} u^{k+1}$ \\[1mm]\>\> $u^k = u^k + M^k \left(b^k - A^k u^k\right)$ \\[1mm]\>else \\[1mm]\>\> $u^k = \left(A^k\right)^{-1} b^k$\\[1mm]\>endif \\[1mm]\>return $u^k$ \\[1mm]end\end{tabbing}\end{minipage}}\caption{Application phase of a V-cycle preconditioner.\label{fig:application_alg}}\end{center}\end{figure}The components produced in the build phase may be combined in several waysto obtain different multilevel preconditioners;this is  done in the application phase, i.e., in the computation of a vectorof type $w=B^{-1}v$, where $B$ denotes the preconditioner, usually within an iterationof a Krylov solver \cite{Saad_book}. An example of such a combination, known asV-cycle, is given in Figure~\ref{fig:application_alg}. In this case, a single iterationof the same smoother is used before and after the the recursive call to the V-cycle (i.e.,in the pre-smoothing and post-smoothing phases); however, different choices can beperformed. Other cycles can be defined; in MLD2P4, we implemented the standard V-cycleand W-cycle~\cite{Briggs2000}, and a version of the K-cycle described in~\cite{Notay2008}. \subsection{Smoothed Aggregation\label{sec:aggregation}}In order to define the prolongator $P^k$, used to computethe coarse-level matrix $A^{k+1}$, MLD2P4 uses the smoothed aggregationalgorithm described in \cite{BREZINA_VANEK,VANEK_MANDEL_BREZINA}.The basic idea of this algorithm is to build a coarse set of indices$\Omega^{k+1}$ by suitably grouping the indices of $\Omega^k$ into disjointsubsets (aggregates), and to define the coarse-to-fine space transfer operator$P^k$ by applying a suitable smoother to a simple piecewise constantprolongation operator, with the aim of improving the quality of the coarse-space correction.Three main steps can be identified in the smoothed aggregation procedure:\begin{enumerate}        \item aggregation of the indices of $\Omega^k$ to obtain $\Omega^{k+1}$;        \item construction of the prolongator $P^k$;        \item application of $P^k$ and $R^k=(P^k)^T$ to build $A^{k+1}$.\end{enumerate} In order to perform the coarsening step, the smoothed aggregation algorithmdescribed in~\cite{VANEK_MANDEL_BREZINA} is used. In this algorithm,each index $j \in \Omega^{k+1}$ corresponds to an aggregate $\Omega^k_j$ of $\Omega^k$,consisting of a suitably chosen index $i \in \Omega^k$ and indices that are (usually) contained in astrongly-coupled neighborood of $i$, i.e.,\begin{equation}\label{eq:strongly_coup}   \Omega^k_j \subset \mathcal{N}_i^k(\theta) =    \left\{ r \in \Omega^k: |a_{ir}^k| > \theta \sqrt{|a_{ii}^ka_{rr}^k|} \right \} \cup \left\{ i \right\},\end{equation}for a given threshold $\theta \in [0,1]$ (see~\cite{VANEK_MANDEL_BREZINA} for the details).Since this algorithm has a sequential nature, a decoupledversion of it is applied, where each processor independently executesthe algorithm on the set of indices assigned to it in the initial datadistribution. This version is embarrassingly parallel, since it does not require any data communication. On the other hand, it may produce some nonuniform aggregatesand is strongly dependent on the number of processors and on the initial partitioningof the matrix $A$. Nevertheless, this parallel algorithm has been chosen forMLD2P4, since it has been shown to produce good results in practice\cite{aaecc_07,apnum_07,TUMINARO_TONG}.The prolongator $P^k$ is built starting from a tentative prolongator$\bar{P}^k \in \mathbb{R}^{n_k \times n_{k+1}}$, defined as$$\bar{P}^k =(\bar{p}_{ij}^k), \quad  \bar{p}_{ij}^k = \left\{ \begin{array}{ll}1 & \quad \mbox{if} \; i \in \Omega^k_j, \\0 & \quad \mbox{otherwise},\end{array} \right.\label{eq:tent_prol}$$where $\Omega^k_j$ is the aggregate of $\Omega^k$corresponding to the index $j \in \Omega^{k+1}$.$P^k$ is obtained by applying to $\bar{P}^k$ a smoother$S^k \in \mathbb{R}^{n_k \times n_k}$:$$P^k = S^k \bar{P}^k,$$in order to remove nonsmooth components from the range of the prolongator,and hence to improve the convergence properties of the multi-levelmethod~\cite{BREZINA_VANEK,Stuben_01}.A simple choice for $S^k$ is the damped Jacobi smoother:$$S^k = I - \omega^k (D^k)^{-1} A^k_F , $$where $D^k$ is the diagonal matrix with the same diagonal entries as $A^k$,$A^k_F = (\bar{a}_{ij}^k)$ is the filtered matrix defined as\begin{equation}\label{eq:filtered}  \bar{a}_{ij}^k =   \left \{ \begin{array}{ll}   a_{ij}^k & \mbox{if } j \in \mathcal{N}_i^k(\theta), \\   0            & \mbox{otherwise},   \end{array} \right.   \; (j \ne i),   \qquad   \bar{a}_{ii}^k = a_{ii}^k - \sum_{j \ne i} (a_{ij}^k - \bar{a}_{ij}^k),\end{equation}and $\omega^k$ is an approximation of $4/(3\rho^k)$, where$\rho^k$ is the spectral radius of $(D^k)^{-1}A^k_F$ \cite{BREZINA_VANEK}.In MLD2P4 this approximation is obtained by using $\| A^k_F \|_\infty$ as an estimateof $\rho^k$. Note that for systems coming from uniformly ellipticproblems, filtering the matrix $A^k$ has little or no effect, and$A^k$ can be used instead of $A^k_F$. The latter choice is the default in MLD2P4.\subsection{Smoothers and coarsest-level solvers\label{sec:smoothers}}The smoothers implemented in MLD2P4 include the Jacobi and block-Jacobi methods,a hybrid version of the forward and backward Gauss-Seidel methods, and theadditive Schwarz (AS) ones (see, e.g., \cite{Saad_book,dd2_96}). The hybrid Gauss-Seidelversion is considered because the original Gauss-Seidel method is inherently sequential.At each iteration of the hybrid version, each parallel process uses the most recent valuesof its own local variables and the values of the non-local variables computed at theprevious iteration, obtained by exchanging data with other processes beforethe beginning of the current iteration.In the AS methods, the index space $\Omega^k$ is divided into $m_k$subsets $\Omega^k_i$ of size $n_{k,i}$,  possiblyoverlapping. For each $i$ we consider the restrictionoperator $R_i^k \in \mathbb{R}^{n_{k,i} \times n_k}$% $R_i^k: \mathbb{R}^{n_k} \longrightarrow \mathbb{R}^{n_{k,i}}$that maps a vector $x^k$ to the vector $x_i^k$ made of the components of $x^k$with indices in $\Omega^k_i$, and the prolongation operator$P^k_i = (R_i^k)^T$. These operators are then  used to build$A_i^k=R_i^kA^kP_i^k$, which is the restriction of $A^k$ to the indexspace $\Omega^k_i$.The classical AS preconditioner $M^k_{AS}$ is defined as\[    ( M^k_{AS} )^{-1} = \sum_{i=1}^{m_k} P_i^k (A_i^k)^{-1} R_i^{k},\]where $A_i^k$ is supposed to be nonsingular. We observe that an approximateinverse of $A_i^k$ is usually considered instead of $(A_i^k)^{-1}$.The setup of $M^k_{AS}$ during the multilevel build phaseinvolves\begin{itemize}  \item the definition of the index subspaces $\Omega_i^k$ and of the corresponding   operators $R_i^k$ (and $P_i^k$);  \item the computation of the submatrices $A_i^k$;  \item the computation of their inverses (usually approximated    through  some form  of incomplete factorization).\end{itemize}The computation of $z^k=M^k_{AS}w^k$, with $w^k \in \mathbb{R}^{n_k}$, during themultilevel application phase, requires\begin{itemize}	\item the restriction of $w^k$ to the subspaces $\mathbb{R}^{n_{k,i}}$,	  i.e.\ $w_i^k = R_i^{k} w^k$;	\item the computation of the vectors $z_i^k=(A_i^k)^{-1} w_i^k$;	\item the prolongation and the sum of the previous vectors,    i.e.\ $z^k = \sum_{i=1}^{m_k} P_i^k z_i^k$.\end{itemize}Variants of the classical AS method, which use modifications of therestriction and prolongation operators, are also implemented in MLD2P4.Among them, the Restricted AS (RAS) preconditioner usuallyoutperforms the classical AS preconditioner in terms of convergencerate and of computation and communication time on parallel distributed-memorycomputers, and is therefore the most widely used among the ASpreconditioners~\cite{CAI_SARKIS}. Direct solvers based on sparse LU factorizations, implemented in thethird-party libraries reported in Section~\ref{sec:third-party}, can be appliedas coarsest-level solvers by MLD2P4. Native inexact solvers based onincomplete LU factorizations, as well as Jacobi, hybrid (forward) Gauss-Seidel,and block Jacobi preconditioners are also available. Direct solvers usuallylead to more effective preconditioners in terms of algorithmic scalability;however, this does not guarantee parallel efficiency.%%% Local Variables: %%% mode: latex%%% TeX-master: "userguide"%%% End: 
\ No newline at end of file
diff --git a/docs/src/bibliography.tex b/docs/src/bibliography.tex
index 7d8290fa..98305063 100644
--- a/docs/src/bibliography.tex
+++ b/docs/src/bibliography.tex
@@ -6,13 +6,19 @@
 
 %\let\refname\relax
 %
+\bibitem{MUMPS}
+P.~R.~Amestoy, C.~Ashcraft, O.~Boiteau, A.~Buttari, J.~L'Excellent, C.~Weisbecker,
+{\em Improving multifrontal methods by means of block low-rank representations},
+SIAM Journal on Scientific Computing, volume 37 (3), 2015, A1452--A1474.
+See also {\tt http://mumps.enseeiht.fr}.
+%
 \bibitem{BREZINA_VANEK}
-M.~Brezina, P.~Van{\v e}k,
+M.~Brezina, P.~Van\v{e}k,
 {\em A Black-Box Iterative Solver Based on a Two-Level Schwarz Method},
 Computing, 63, 1999, 233--263.
 %
 \bibitem{Briggs2000}
-W.~L.~Briggs, V.~E.~Henson, S.~F.~ McCormick, 
+W.~L.~Briggs, V.~E.~Henson, S.~F.~McCormick, 
 {\em A Multigrid Tutorial, Second Edition},
 SIAM, 2000.
 %
@@ -32,28 +38,29 @@ Applicable Algebra in Engineering, Communications and Computing,
 18 (3) 2007, 223--239.
 %Published online: 13 February 2007, {\tt http://dx.doi.org/10.1007/s00200-007-0035-z}
 %
-\bibitem{apnum_07}  P.~D'Ambra, S.~Filippone,  D.~di~Serafino,
-{\em On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners},
-Applied Numerical Mathematics, Elsevier Science, 
-57 (11-12), 2007, 1181-1196.
-%published online 3 February 2007, {\tt
-%  http://dx.doi.org/10.1016/j.apnum.2007.01.006}
-%
 \bibitem{CAI_SARKIS}
 X.~C.~Cai, M.~Sarkis,
 {\em A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems},
 SIAM Journal on Scientific Computing, 21 (2), 1999, 792--797.
 %
-\bibitem{Cai_Widlund_92}
-X.~C.~Cai, O.~B.~Widlund,
-{\em Domain Decomposition Algorithms for Indefinite Elliptic Problems},
-SIAM Journal on Scientific and Statistical Computing, 13 (1), 1992, 243--258.
+%\bibitem{Cai_Widlund_92}
+%X.~C.~Cai, O.~B.~Widlund,
+%{\em Domain Decomposition Algorithms for Indefinite Elliptic Problems},
+%SIAM Journal on Scientific and Statistical Computing, 13 (1), 1992, 243--258.
+%
+%\bibitem{dd1_94}
+%T.~Chan and T.~Mathew,
+%{\em Domain Decomposition Algorithms},
+%in A.~Iserles, editor, Acta Numerica 1994, 61--143.
+%Cambridge University Press.
 %
-\bibitem{dd1_94}
-T.~Chan and T.~Mathew,
-{\em Domain Decomposition Algorithms},
-in A.~Iserles, editor, Acta Numerica 1994, 61--143.
-Cambridge University Press.
+\bibitem{apnum_07}
+P.~D'Ambra, S.~Filippone,  D.~di~Serafino,
+{\em On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners},
+Applied Numerical Mathematics, Elsevier Science, 
+57 (11-12), 2007, 1181-1196.
+%published online 3 February 2007, {\tt
+%  http://dx.doi.org/10.1016/j.apnum.2007.01.006}
 %
 \bibitem{MLD2P4_TOMS} 
 P.~D'Ambra, D.~di~Serafino, S.~Filippone,
@@ -62,20 +69,14 @@ Algebraic Domain Decomposition Preconditioners
 in Fortran 95},  ACM Trans. Math. Softw., 37(3), 2010, art. 30.
 %
 \bibitem{UMFPACK}
-T.A.~Davis, 
+T.~A.~Davis, 
 {\em Algorithm 832: UMFPACK - an Unsymmetric-pattern Multifrontal
 Method with a Column Pre-ordering Strategy},
 ACM Transactions on Mathematical Software, 30, 2004, 196--199.
 (See also {\tt http://www.cise.ufl.edu/~davis/})
 %
-\bibitem{MUMPS}
-P.R.~Amestoy, C.~Ashcraft, O.~Boiteau, A.~Buttari, J.~L'Excellent, C.~Weisbecker 
-{\em Improving multifrontal methods by means of block low-rank representations},
-SIAM Journal on Scientific Computing, volume 37 (3), 2015, A1452--A1474.
-See also {\tt http://mumps.enseeiht.fr}.
-%
 \bibitem{SUPERLU}
-J.W.~Demmel, S.C.~Eisenstat, J.R.~Gilbert, X.S.~Li and J.W.H.~Liu,
+J.~W.~Demmel, S.~C.~Eisenstat, J.~R.~Gilbert, X.~S.~Li, J.~W.~H.~Liu,
 A supernodal approach to sparse partial pivoting,
 SIAM Journal on Matrix Analysis and Applications, 20 (3), 1999, 720--755.
 %
@@ -89,16 +90,16 @@ J.~J.~Dongarra, J.~Du Croz, S.~Hammarling, R.~J.~Hanson,
 \emph{An extended set of FORTRAN Basic Linear Algebra Subprograms},
 ACM Transactions on Mathematical Software, 14 (1) 1988, 1--17.
 %
-\bibitem{BLACS}
-J.~J.~Dongarra and R.~C.~Whaley,
-{\em A User's Guide to the BLACS v.~1.1},
-Lapack Working Note 94, Tech.\ Rep.\ UT-CS-95-281, University of
-Tennessee, March 1995 (updated May 1997).
+%\bibitem{BLACS}
+%J.~J.~Dongarra, R.~C.~Whaley,
+%{\em A User's Guide to the BLACS v.~1.1},
+%Lapack Working Note 94, Tech.\ Rep.\ UT-CS-95-281, University of
+%Tennessee, March 1995 (updated May 1997).
 %
-\bibitem{EFSTATHIOU} 
-E.~Efstathiou, J.~G.~Gander,
-{\em Why Restricted Additive Schwarz Converges Faster than Additive Schwarz},
-BIT Numerical Mathematics, 43 (5), 2003, 945--959.
+%\bibitem{EFSTATHIOU} 
+%E.~Efstathiou, J.~G.~Gander,
+%{\em Why Restricted Additive Schwarz Converges Faster than Additive Schwarz},
+%BIT Numerical Mathematics, 43 (5), 2003, 945--959.
 %
 \bibitem{PSBLASGUIDE}
 S.~Filippone, A.~Buttari, 
@@ -106,9 +107,9 @@ S.~Filippone, A.~Buttari,
 available from \texttt{http://www.ce.uniroma2.it/psblas/}.
 %
 \bibitem{PSBLAS3}
-Salvatore Filippone and Alfredo Buttari.
+S.~Filippone, A.~Buttari,
 {\em Object-Oriented Techniques for Sparse Matrix Computations in Fortran 2003}.
-ACM Transactions on on Mathematical Software, 38 (4), 2012, art. 23.
+ACM Transactions on on Mathematical Software, 38 (4), 2012, art.~23.
 %
 \bibitem{psblas_00}
 S.~Filippone, M.~Colajanni, 
@@ -127,12 +128,14 @@ C.~L.~Lawson, R.~J.~Hanson, D.~Kincaid, F.~T.~Krogh,
 ACM Transactions on Mathematical Software, 5 (3), 1979, 308--323.
 %
 \bibitem{SUPERLUDIST}
-X.~S.~Li, J.~W.~Demmel, {\em SuperLU\_DIST: A Scalable Distributed-memory
+X.~S.~Li, J.~W.~Demmel,
+{\em SuperLU\_DIST: A Scalable Distributed-memory
 Sparse Direct Solver for Unsymmetric Linear Systems},
 ACM Transactions on Mathematical Software, 29 (2), 2003, 110--140.
 %
 \bibitem{Notay2008}
-Y.~Notay, P.~S.~Vassilevski, {\em Recursive Krylov-based multigrid cycles},
+Y.~Notay, P.~S.~Vassilevski,
+{\em Recursive Krylov-based multigrid cycles},
 Numerical Linear Algebra with Applications, 15 (5), 2008, 473--487. 
 %
 \bibitem{Saad_book}
@@ -149,7 +152,7 @@ Cambridge University Press, 1996.
 M.~Snir, S.~Otto, S.~Huss-Lederman, D.~Walker, J.~Dongarra,
 {\em MPI: The Complete Reference. Volume 1 - The MPI Core}, second edition,
 MIT Press, 1998.
-%%
+%
 \bibitem{Stuben_01}
 K.~St\"{u}ben,
 {\em An Introduction to Algebraic Multigrid},
@@ -161,9 +164,8 @@ R.~S.~Tuminaro, C.~Tong,
 {\em Parallel Smoothed Aggregation Multigrid: Aggregation Strategies on Massively Parallel Machines}, in J. Donnelley, editor, Proceedings of SuperComputing 2000, Dallas, 2000.
 %
 \bibitem{VANEK_MANDEL_BREZINA}
-P.~Van{\v e}k, J.~Mandel and M.~Brezina,
+P.~Van\v{e}k, J.~Mandel, M.~Brezina,
 {\em Algebraic Multigrid by Smoothed Aggregation for Second and Fourth Order Elliptic Problems},
 Computing, 56 (3) 1996, 179--196.
-%
 
 \end{thebibliography}
diff --git a/docs/src/building.tex b/docs/src/building.tex
index bc0e49e6..163d2dfe 100644
--- a/docs/src/building.tex
+++ b/docs/src/building.tex
@@ -17,7 +17,15 @@ recommend to use at least version 4.8.
 The software defines data types and interfaces for
 real and complex data, in both single and double precision. 
 
-\subsection{Prerequisites}
+Building MLD2P4 requires some base libraries (see Section~\ref{sec:prerequisites});
+interfaces to optional third-party libraries, which extend the functionalities of MLD2P4
+(see Section~\ref{sec:third-party}), are also available.  Many Linux distributions
+(e.g., Ubuntu, Fedora, CentOS) provide precompiled packages for the prerequisite and
+optional software. In many cases these packages are split between a runtime part and a
+``developer'' part; in order to build MLD2P4 you need both. A description of the base and
+optional software used by MLD2P4 is given in the next sections.
+
+\subsection{Prerequisites\label{sec:prerequisites}}
 
 The following base libraries are needed: 
 \begin{description}
@@ -68,18 +76,18 @@ for multi-level preconditioners may change to reflect their presence.
   A sparse LU factorization package available from \url{mumps.enseeiht.fr};
   it provides sequential and parallel factorizations and triangular system solution
   for single and double precision, real and complex data.
-  We tested versions 4.10.0 and version 5.0.1.
+  We tested versions 4.10.0 and 5.0.1.
 \item[SuperLU] \cite{SUPERLU}
   A sparse LU factorization package available from
   \url{crd.lbl.gov/~xiaoye/SuperLU/}; it provides sequential
   factorization and triangular system solution for single and double precision,
-  real and complex data. We tested version 4.3 and 5.0. If you installed BLAS from
+  real and complex data. We tested versions 4.3 and 5.0. If you installed BLAS from
   ATLAS, remember to define the BLASLIB variable in the make.inc file.
  \item[SuperLU\_Dist] \cite{SUPERLUDIST}
    A sparse LU factorization package available
    from the same site as SuperLU; it provides parallel factorization and
    triangular system solution for double precision real and complex data.
-   We tested version 3.3 and 4.2. If you installed BLAS from
+   We tested versions 3.3 and 4.2. If you installed BLAS from
    ATLAS, remember to define the BLASLIB variable in the make.inc file and
    to add the \verb|-std=c99| option to the C compiler options.
    Note that this library requires the ParMETIS
@@ -278,7 +286,7 @@ install directory under the name \verb|Make.inc.MLD2P4|.
 To use the MUMPS solver package, 
 the user has to add the appropriate options to the configure script;
 by default we are looking for the libraries
-\verb|-ldmumps -lsmumps| \verb|-lzmumps -mumps_common -lpord|.
+\verb|-ldmumps -lsmumps| \verb| -lzmumps -lcmumps -mumps_common -lpord|.
 MUMPS often uses additional packages such as ScaLAPACK, ParMETIS,
 SCOTCH, as well as enabling OpenMP; in such cases it is necessary to
 add linker options with the \verb|--with-extra-libs| configure option.
@@ -291,23 +299,23 @@ followed (optionally) by
 \begin{verbatim}
 make install
 \end{verbatim}
-Many Linux distributions (e.g. Ubuntu, Fedora, CentOS) provide
-precompiled packages for the prerequisite softwares; in many cases the
-software packages are split between a runtime part and a ``developer''
-part, to rebuild MLD2P4 you'll need both. 
-
 \subsection{Bug reporting}
-If you find any bugs in our codes, please let us know at
-\begin{rawhtml}
-<a href="mailto:bugreport@mld2p4.it">
-\end{rawhtml}
-\texttt{bugreport@mld2p4.it}
-\begin{rawhtml}
-</a>
-\end{rawhtml}
-; be aware that  
-the amount of information needed to reproduce a problem in a parallel
-program may vary quite a lot.
+If you find any bugs in our codes, please send an email to \\[2mm]
+\texttt{pasqua.dambra@cnr.it} \\
+\texttt{daniela.diserafino@unicampania.it} \\
+\texttt{salvatore.filippone@cranfield.ac.uk} \\[2mm]
+% please let us know at
+%\begin{rawhtml}
+%<a href="mailto:bugreport@mld2p4.it">
+%\end{rawhtml}
+%\texttt{bugreport@mld2p4.it}
+%\begin{rawhtml}
+%</a>
+%\end{rawhtml}
+%; 
+You should be aware that the amount of information needed to reproduce a problem
+in a parallel program may vary quite a lot.
+
 \subsection{Example and test programs\label{sec:ex_and_test}}
 The package contains the \verb|examples| and \verb|tests| directories;
 both of them are further divided into \verb|fileread| and
diff --git a/docs/src/distribution.tex b/docs/src/distribution.tex
index 6fd6c7ae..bc16b84b 100644
--- a/docs/src/distribution.tex
+++ b/docs/src/distribution.tex
@@ -8,7 +8,6 @@ MLD2P4 is available from the web site
 \texttt{http://www.mld2p4.it}
 \end{quotation}
 where contact points for further information can be also found.
-{\bf Passiamo subito a GitHub?}
 
 The software is available under a modified BSD license, as specified
 in Appendix~\ref{sec:license}; please note that some of the optional
diff --git a/docs/src/gettingstarted.tex b/docs/src/gettingstarted.tex
index 8f99a12c..758d28f1 100644
--- a/docs/src/gettingstarted.tex
+++ b/docs/src/gettingstarted.tex
@@ -70,7 +70,7 @@ No preconditioner &\verb|'NOPREC'|& Considered only to use the PSBLAS
                                     Krylov solvers with no preconditioner. \\ \hline
 Diagonal          & \verb|'DIAG'| or \verb|'JACOBI'| & Diagonal preconditioner.
                          For any zero diagonal entry of the matrix to be preconditioned,
-                         the corresponding entry of he preconditioner is set to~1.\\ \hline
+                         the corresponding entry of the preconditioner is set to~1.\\ \hline
 Block Jacobi      & \verb|'BJAC'| & Block-Jacobi with ILU(0) on the local blocks.\\ \hline
 Additive Schwarz  & \verb|'AS'|   & Restricted Additive Schwarz (RAS),
                                     with overlap~1 and ILU(0) on the local blocks. \\ \hline
@@ -176,11 +176,11 @@ the corresponding codes are available in \verb|examples/fileread/|.
 ! with basic smoothed aggregation, 1 hybrid forward/backward
 ! GS sweep as pre/post-smoother and UMFPACK as coarsest-level
 ! solver
-  call P%init(P,'ML',info)
+  call P%init('ML',info)
 !
 ! build the preconditioner
-  call P%hierarchy_build(A,desc_A,P,info)
-  call P%smoothers_build(A,desc_A,P,info)
+  call P%hierarchy_build(A,desc_A,info)
+  call P%smoothers_build(A,desc_A,info)
 
 !
 ! set the solver parameters and the initial guess
@@ -191,7 +191,7 @@ the corresponding codes are available in \verb|examples/fileread/|.
   ... ...
 !
 ! deallocate the preconditioner
-  call P%free(P,info)
+  call P%free(info)
 !
 ! deallocate other data structures
   ... ...
@@ -260,12 +260,12 @@ boundary conditions are also available in the directory \verb|examples/pdegen|.
 ! build a V-cycle preconditioner with 1 block-Jacobi sweep (with 
 ! ILU(0) on the blocks) as pre- and post-smoother, and 8  block-Jacobi
 ! sweeps (with ILU(0) on the blocks) as coarsest-level solver
-  call P%init(P,'ML',info)
-  call_P%set(P,'SMOOTHER_TYPE','BJAC',info)
-  call P%set(P,'COARSE_SOLVE','BJAC',info)
-  call P%set(P,'COARSE_SWEEPS',8,info)
-  call P%hierarchy_build(A,desc_A,P,info)
-  call P%smoothers_build(A,desc_A,P,info)
+  call P%init('ML',info)
+  call_P%set('SMOOTHER_TYPE','BJAC',info)
+  call P%set('COARSE_SOLVE','BJAC',info)
+  call P%set('COARSE_SWEEPS',8,info)
+  call P%hierarchy_build(A,desc_A,info)
+  call P%smoothers_build(A,desc_A,info)
 ... ...
 \end{verbatim}
 }
@@ -284,17 +284,17 @@ boundary conditions are also available in the directory \verb|examples/pdegen|.
 ! build a W-cycle preconditioner with 2 Gauss-Seidel sweeps as 
 ! post-smoother (and no pre-smoother), a distributed coarsest
 ! matrix, and MUMPS as coarsest-level solver
-  call P%init(P,'ML',info)
+  call P%init('ML',info)
   call P%set('ML_TYPE','WCYCLE',info)
   call P%set('SMOOTHER_TYPE','GS',info)
   call P%set('SMOOTHER_SWEEPS',0,info,pos='PRE')
   call P%set('SMOOTHER_SWEEPS',2,info,pos='POST')
   call P%set('COARSE_SOLVE','MUMPS',info)
   call P%set('COARSE_MAT','DIST',info)
-  call P%hierarchy_build(A,desc_A,P,info)
-  call P%smoothers_build(A,desc_A,P,info)
+  call P%hierarchy_build(A,desc_A,info)
+  call P%smoothers_build(A,desc_A,info)
 ... ...
-! solve Ax=b with preconditioned CG
+! solve Ax=b with preconditioned BiCGSTAB
   call psb_krylov('BICGSTAB',A,P,b,x,tol,desc_A,info)
 \end{verbatim}
 }
@@ -310,9 +310,9 @@ boundary conditions are also available in the directory \verb|examples/pdegen|.
 \begin{verbatim}
 ... ...
 ! set RAS with overlap 2 and ILU(0) on the local blocks
-  call P%init(P,'AS',info)
-  call P%set(P,'SUB_OVR',2,info)
-  call P%bld(A,desc_A,P,info)
+  call P%init('AS',info)
+  call P%set('SUB_OVR',2,info)
+  call P%bld(A,desc_A,info)
 ... ...
 \end{verbatim}
 }
diff --git a/docs/src/intro.tex b/docs/src/intro.tex
deleted file mode 100644
index 864ab263..00000000
--- a/docs/src/intro.tex
+++ /dev/null
@@ -1,34 +0,0 @@
-\section{Introduction}\label{sec:intro}
-\markboth{\underline{MLD2P4 User's and Reference Guide}}
-         {\underline{\ref{sec:overview} Introduction}}
-
-The MLD2P4 library provides ....
-
-
-\subsection{Programming model}
-
-The MLD2P4 librarary is based on the Single Program Multiple Data
-(SPMD) programming model: each process participating in the
-computation performs the same actions on a chunk of data. Parallelism
-is thus data-driven. 
-
-Because of this structure, many subroutines coordinate their action
-across the various processes, thus providing an implicit
-synchronization point, and therefore \emph{must} be
-called simultaneously by all processes participating in the
-computation. 
-However there are many cases where no synchronization, and indeed no
-communication among processes, is implied. 
-
-Throughout this user's guide each subroutine will be clearly indicated
-as:
-\begin{description}
-\item[Synchronous:] must be called simultaneously by all the
-  processes in the relevant communication context;
-\item[Asynchronous:] may be called in a totally independent manner.
-\end{description}
-
-%%% Local Variables: 
-%%% mode: latex
-%%% TeX-master: "userguide"
-%%% End: 
diff --git a/docs/src/newobjects.tex b/docs/src/newobjects.tex
new file mode 100644
index 00000000..aa7ec330
--- /dev/null
+++ b/docs/src/newobjects.tex
@@ -0,0 +1,85 @@
+
+\clearpage
+
+\section{Adding new  smoother and solver objects to MLD2P4\label{sec:adding}}
+
+Developers can add completely new smoother and/or solver classes
+derived from the base objects in the library (see Remark~2 in Section~\ref{sec:precset}),
+without recompiling the library itself. 
+
+To do so, it is necessary first to select the base type to be extended.
+In our experience, it is quite likely that the new application needs
+only the definition of a ``solver'' object, which is almost
+always acting only on the local part of the distributed matrix. 
+The parallel actions required to connect the various solver objects
+are most often already provided by the block-Jacobi or the additive
+Schwarz smoothers.  To define a new solver, the developer will then
+have to define its components and methods, perhaps taking one of the
+predefined solvers as a starting point, if possible. 
+
+Once the new smoother/solver class has been developed, to use it in
+the context of the multilevel preconditioners it is necessary to:
+\begin{itemize}
+\item declare in the application program a variable of the new type;
+\item  pass that variable as the argument to the \verb|set| routine as in the
+following:
+\begin{center}
+\verb|call p%set(smoother,info [,ilev,ilmax,pos])|\\
+\verb|call p%set(solver,info [,ilev,ilmax,pos])|
+\end{center}
+\item link the code implementing the various methods into the application executable.
+\end{itemize}
+The new solver object is then dynamically included in the
+preconditioner structure, and acts as a \emph{mold} to which the 
+preconditioner will conform, even though the MLD2P4 library has not
+been modified to account for this new development. 
+
+It is possible to define new values for the keyword \verb|WHAT| in the
+\verb|set| routine; if the library code does not recognize a keyword,
+it passes it down the composition hierarchy (levels containing
+smoothers containing in turn solvers), so that it can be eventually caught by
+the new solver. 
+
+An example is provided in the source code distribution under the
+folder \verb|tests/newslv|. In this example we are implementing a new
+incomplete factorization variant (which is simply the ILU(0)
+factorization under a new name). Because of the specifics of  this case, it is
+possible to reuse the basic structure of the ILU solver, with its
+L/D/U components and the methods needed to apply the solver; only a
+few methods, such as the description and most importantly the build,
+need to be ovverridden (rewritten). 
+
+The interfaces for the calls shown above are defined using
+\begin{center}
+\begin{tabular}{p{1.4cm}p{12cm}}
+\verb|smoother| & \verb|class(mld_x_base_smoother_type)| \\
+              & The user-defined new smoother to be employed in the
+                preconditioner.\\
+\verb|solver| & \verb|class(mld_x_base_solver_type)| \\
+              & The user-defined new solver to be employed in the
+                preconditioner.
+\end{tabular}
+\end{center}
+The other arguments are defined in the way described in
+Sec.~\ref{sec:precset}.  As an example, in  the \verb|tests/newslv|
+code we define a new object of type \verb|mld_d_tlu_solver_type|, and
+we pass it as follows:
+\begin{verbatim}
+
+  ! sparse matrix and preconditioner
+  type(psb_dspmat_type) :: a
+  type(mld_dprec_type)  :: prec
+  type(mld_d_tlu_solver_type) :: tlusv
+
+......
+  !
+  !  prepare the preconditioner: an ML with defaults, but with TLU solver at
+  !  intermediate levels. All other parameters are at default values. 
+  !  
+  call prec%init('ML',       info)
+  call prec%hierarchy_build(a,desc_a,info)
+  nlv = prec%get_nlevs()
+  call prec%set(tlusv,   info,ilev=1,ilmax=max(1,nlv-1))
+  call prec%smoothers_build(a,desc_a,info)
+
+\end{verbatim}
diff --git a/docs/src/overview.tex b/docs/src/overview.tex
index 30079b4b..6c88056f 100644
--- a/docs/src/overview.tex
+++ b/docs/src/overview.tex
@@ -25,7 +25,7 @@ The multi-level preconditioners implemented in MLD2P4 are obtained by combining
 AMG cycles with smoothers and coarsest-level solvers. The V-, W-, and
 K-cycles~\cite{Briggs2000,Notay2008} are available, which allow to define
 almost all the preconditioners in the package, including the multi-level hybrid
-Schwarz ones; a specific cycle is implemented to obained multi-level additive
+Schwarz ones; a specific cycle is implemented to obtain multi-level additive
 Schwarz preconditioners. The Jacobi, hybrid
 %\footnote{see Note 2 in Table~\ref{tab:p_coarse}, p.~28.}
 forward/backward Gauss-Seidel, block-Jacobi, and additive Schwarz methods
@@ -62,8 +62,8 @@ portability, modularity ed extensibility in the development of the preconditione
 package. On the other hand, the implementation of MLD2P4 has led to some
 revisions and extentions of the original PSBLAS kernels.
 The inter-process comunication required by MLD2P4 is encapsulated
-in the PSBLAS routines;% , except few cases where MPI~\cite{MPI1} is explicitly called. 
-% \textbf{E' ancora cos\'{i} o adesso \`e tutto incapsulato in PSBLAS?}
+in the PSBLAS routines;
+% , except few cases where MPI~\cite{MPI1} is explicitly called. 
 therefore, MLD2P4 can be run on any parallel machine where PSBLAS 
 implementations are available.
 
diff --git a/docs/src/precs.tex b/docs/src/precs.tex
deleted file mode 100644
index 582b2b12..00000000
--- a/docs/src/precs.tex
+++ /dev/null
@@ -1,384 +0,0 @@
-\section{Preconditioner routines}
-\label{sec:precs}
-\markboth{\underline{MLD2P4 User's and Reference Guide}}
-         {\underline{\ref{sec:precs} Preconditioners}}
-
-% \section{Preconditioners}
-\label{sec:psprecs}
-The MLD2P4 library  contains the implementation of many preconditioning
-techniques. The preconditioners may be applied as normal ``base'' 
-preconditioners; alternatively multiple ``base'' preconditioners may
-be combined  in a multilevel framework. 
-
-The base (one-level) preconditioners include: 
-\begin{itemize}
-\item Diagonal Scaling
-\item Block Jacobi
-\item Hybrid Gauss-Seidel;
-\item Additive Schwarz, Restricted Additive Schwarz and
-  Additive Schwarz with Harmonic extensions;
-\end{itemize}
-The Jacobi and Additive Schwarz preconditioners can make use of the
-following solvers:
-\begin{itemize}
-\item Level-$p$ Incomplete LU factorization ($ILU(p)$);
-\item Threshold Incomplete LU factorization ($ILU(\tau,p)$);
-\item Complete LU factorization by means of the following optional
-  external   packages: 
-\begin{itemize}
-\item UMFPACK;
-\item SuperLU;
-\item SuperLU\_Dist.
-\end{itemize}
-\end{itemize}
-
-The supporting data type and subroutine interfaces are defined in the
-module  \verb|mld_prec_mod|; the module also overrides the variables
-and tyep definitions of \verb|psb_prec_mod| so as to function as a
-drop-in replacement for the PSBLAS methods. Thus if the user does not
-wish to employ the additional MLD2P4 capabitlities, it is possible to
-migrate an existing PSBLAS program without any source code
-modifications, only a recompilation is needed. 
-
-%% We also provide a companion  package of multi-level Additive
-%%   Schwarz preconditioners called MD2P4; this is actually a family of 
-%%   preconditioners since there is the possibility to choose between
-%%   many variants, and is currently in an experimental stateIts
-%%   documentation is planned to appear after stabilization of the
-%%   package, which will characterize release  2.1 of our library.
-
-
-
-
-\subroutine{mld\_precinit}{Initialize a  preconditioner}
-
-\syntax{call mld\_precinit}{prec, ptype, info}
-\syntax*{call mld\_precinit}{prec, ptype, info, nlev}
-
-\begin{description}
-\item[Type:] Asynchronous.
-\item[\bf On Entry]
-\item[ptype] the type of preconditioner. 
-Scope: {\bf global} \\
-Type: {\bf required}\\
-Intent: {\bf in}.\\
-Specified as: a character string, see usage notes.
-\item[nlev] Number of levels in a multilevel  precondtioner. 
-Scope: {\bf global} \\
-Type: {\bf optional}\\
-Specified as: an integer value, see usage notes. 
-%% \item[rs] 
-%% Scope: {\bf global} \\
-%% Type: {\bf optional}\\
-%% Specified as: a long precision real number.
-\item[\bf On Exit]
-
-\item[prec] 
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a preconditioner data structure \precdata.
-\item[info] 
-Scope: {\bf global} \\
-Type: {\bf required}\\
-Intent: {\bf out}.\\
-Error code: if no error, 0 is returned.
-\end{description}
-\subsection*{Usage Notes}
-%% The PSBLAS 2.0 contains a number of preconditioners, ranging from a
-%% simple diagonal scaling to 2-level domain decomposition. These
-%% preconditioners may use the SuperLU or the UMFPACK software, if
-%% installed; see~\cite{SUPERLU,UMFPACK}. 
-Legal inputs to this subroutine are interpreted depending on the
-$ptype$ string as follows\footnote{The string is case-insensitive}:
-\begin{description}
-\item[NONE] No preconditioning, i.e. the preconditioner is just a copy
-  operator.
-\item[DIAG] Diagonal scaling; each entry of the input vector is
-  multiplied by the reciprocal of the sum of the absolute values of
-  the coefficients in the corresponding row of matrix  $A$;
-\item[BJAC] Precondition by a  factorization of the
-  block-diagonal of matrix $A$, where block boundaries are determined
-  by the data allocation boundaries for each process; requires no
-  communication. 
-\item[AS] Additive Schwarz; default is to apply the Restricted
-  Additive Schwarz variant, with an $ILU(0)$ factorization
-\item[ML] Multilevel preconditioner.
-\end{description}
-
-
-
-\subroutine{mld\_precset}{Set preconditioner features}
-
-\syntax{call mld\_precset}{prec, what, val, info, ilev, pos}
-\syntax{call prec\%set}{what, val, info, ilev, pos}
-
-
-\begin{description}
-\item[Type:] Asynchronous.
-\item[\bf On Entry]
-\item[prec] the preconditioner.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: an already initialized precondtioner data structure \precdata\\
-\item[what] The feature to be set. \\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}.\\
-Specified as: an integer constant or a string. Symbolic names are
-available in the library module, see usage notes for legal values. 
-\item[val] The value  to set the chosen feature to. \\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}.\\
-Specified as: an integer, double precision or character variable. 
-Symbolic names for some choices are available in the library module,
-see usage notes for legal values. 
-\item[ilev] The level of a multilevel preconditioner to which the
-  feature choice should apply.\\
-Scope: {\bf global} \\
-Type: {\bf optional}\\
-Specified as: an integer value, see usage notes. 
-\item[pos] The position of the smoother/solver to which the current
-  setting applies.
-  feature choice should apply.\\
-Scope: {\bf global} \\
-Type: {\bf optional}\\
-Specified as: a character variable, with values \verb|pre| or \verb|post|. 
-\end{description}
-
-\begin{description}
-\item[\bf On Return]
-\item[prec] the preconditioner.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a precondtioner data structure \precdata\\
-\item[info] Error code.\\
-Scope: {\bf local} \\
-Type: {\bf required} \\
-Intent: {\bf out}.\\
-An integer value; 0 means no error has been detected. 
-\end{description}
-
-\subsection*{Usage Notes}
-Legal inputs to this subroutine are interpreted depending on the value
-of \verb|what| input as follows
-\begin{description}
-\item[mld\_coarse\_mat\_] 
-\end{description}
-
-\subroutine{mld\_hierarchy\_bld}{Builds a matrix hierarchy}
-
-\syntax{call mld\_hierarchy\_bld}{a, desc\_a, prec, info}
-
-\begin{description}
-\item[Type:] Synchronous.
-\item[\bf On Entry]
-\item[a] the system sparse matrix.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}, target.\\
-Specified as: a sparse matrix data structure \spdata.
-\item[prec] the preconditioner.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: an already initialized precondtioner data structure \precdata\\
-\item[desc\_a] the problem communication descriptor. 
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}, target.\\
-Specified as: a communication descriptor data structure \descdata.
-%% \item[upd] 
-%% Scope: {\bf global} \\
-%% Type: {\bf optional}\\
-%% Intent: {\bf in}.\\
-%% Specified as: a character.
-\end{description}
-
-\begin{description}
-\item[\bf On Return]
-\item[prec] the preconditioner, with all the matrices and transfer
-  operators updated.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a precondtioner data structure \precdata\\
-\item[info] Error code.\\
-Scope: {\bf local} \\
-Type: {\bf required} \\
-Intent: {\bf out}.\\
-An integer value; 0 means no error has been detected. 
-\end{description}
-
-
-\subroutine{mld\_smoothers\_bld}{Builds the smoothers}
-
-\syntax{call mld\_smoothers\_bld}{a, desc\_a, prec, info}
-
-\begin{description}
-\item[Type:] Synchronous.
-\item[\bf On Entry]
-\item[a] the system sparse matrix.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}, target.\\
-Specified as: a sparse matrix data structure \spdata.
-\item[prec] the preconditioner.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: an already initialized precondtioner data structure
-\precdata\ with an already built matrix hierarchy \\
-\item[desc\_a] the problem communication descriptor. 
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}, target.\\
-Specified as: a communication descriptor data structure \descdata.
-%% \item[upd] 
-%% Scope: {\bf global} \\
-%% Type: {\bf optional}\\
-%% Intent: {\bf in}.\\
-%% Specified as: a character.
-\end{description}
-
-\begin{description}
-\item[\bf On Return]
-\item[prec] the preconditioner, with all the smoothers and solvers  
-  updated.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a precondtioner data structure \precdata\\
-\item[info] Error code.\\
-Scope: {\bf local} \\
-Type: {\bf required} \\
-Intent: {\bf out}.\\
-An integer value; 0 means no error has been detected. 
-\end{description}
-
-
-
-
-\subroutine{mld\_precbld}{Builds a preconditioner}
-
-\syntax{call mld\_precbld}{a, desc\_a, prec, info}
-
-\begin{description}
-\item[Type:] Synchronous.
-\item[\bf On Entry]
-\item[a] the system sparse matrix.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}, target.\\
-Specified as: a sparse matrix data structure \spdata.
-\item[prec] the preconditioner.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: an already initialized precondtioner data structure \precdata\\
-\item[desc\_a] the problem communication descriptor. 
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}, target.\\
-Specified as: a communication descriptor data structure \descdata.
-%% \item[upd] 
-%% Scope: {\bf global} \\
-%% Type: {\bf optional}\\
-%% Intent: {\bf in}.\\
-%% Specified as: a character.
-\end{description}
-
-\begin{description}
-\item[\bf On Return]
-\item[prec] the preconditioner.\\
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a precondtioner data structure \precdata\\
-\item[info] Error code.\\
-Scope: {\bf local} \\
-Type: {\bf required} \\
-Intent: {\bf out}.\\
-An integer value; 0 means no error has been detected. 
-\end{description}
-
-
-\subsection*{Usage Notes}
-A call to this routine is equivalent to a call to
-\verb|mld_hierarchy_bld| followed by a call to \verb|mld_smnoothers_bld|.
-
-
-\subroutine{mld\_precaply}{Preconditioner application routine}
-
-\syntax{call mld\_precaply}{prec,x,y,desc\_a,info,trans,work}
-\syntax*{call mld\_precaply}{prec,x,desc\_a,info,trans}
-
-\begin{description}
-\item[Type:] Synchronous.
-\item[\bf On Entry]
-\item[prec] the preconditioner.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}.\\
-Specified as: a preconditioner data structure \precdata.
-\item[x] the source vector.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a double precision array.
-\item[desc\_a] the problem communication descriptor.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}.\\
-Specified as: a communication data structure \descdata.
-\item[trans] 
-Scope: {\bf } \\
-Type: {\bf optional}\\
-Intent: {\bf in}.\\
-Specified as: a character.
-\item[work] an optional work space
-Scope: {\bf local} \\
-Type: {\bf optional}\\
-Intent: {\bf inout}.\\
-Specified as: a double precision array.
-\end{description}
-
-\begin{description}
-\item[\bf On Return]
-\item[y] the destination vector.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf inout}.\\
-Specified as: a double precision array.
-\item[info] Error code.\\
-Scope: {\bf local} \\
-Type: {\bf required} \\
-Intent: {\bf out}.\\
-An integer value; 0 means no error has been detected. 
-\end{description}
-
-
-
-\subroutine{mld\_prec\_descr}{Prints a description of current preconditioner}
-
-\syntax{call mld\_prec\_descr}{prec}
-
-\begin{description}
-\item[Type:] Asynchronous.
-\item[\bf On Entry]
-\item[prec] the preconditioner.
-Scope: {\bf local} \\
-Type: {\bf required}\\
-Intent: {\bf in}.\\
-Specified as: a preconditioner data structure \precdata.
-\end{description}
-
-
-
-%%% Local Variables: 
-%%% mode: latex
-%%% TeX-master: "userguide"
-%%% End: 
diff --git a/docs/src/userguide.tex b/docs/src/userguide.tex
index 07cb2688..224a8607 100644
--- a/docs/src/userguide.tex
+++ b/docs/src/userguide.tex
@@ -157,6 +157,7 @@ based on PSBLAS}
 \include{background}
 \include{gettingstarted}
 \include{userinterface}
+\include{newobjects}
 \include{errors}
 \clearpage
 \appendix
diff --git a/docs/src/userhtml.tex b/docs/src/userhtml.tex
index 6425c1c5..de80420e 100644
--- a/docs/src/userhtml.tex
+++ b/docs/src/userhtml.tex
@@ -133,9 +133,8 @@ Software version: 2.1\\
 \include{background}
 \include{gettingstarted}
 \include{userinterface}
-%\include{advanced}
+\include{newobjects}
 \include{errors}
-%\include{listofroutines}
 \cleardoublepage
 \appendix
 \include{license}
diff --git a/docs/src/userinterface.tex b/docs/src/userinterface.tex
index ecc6f081..fad93c0a 100644
--- a/docs/src/userinterface.tex
+++ b/docs/src/userinterface.tex
@@ -27,7 +27,7 @@ i.e.,
   \verb|s|, \verb|d|, \verb|c|, \verb|z|, according to the sparse
   matrix data structure;
 \item the arrays containing the vectors $v$ and $w$ involved in
-  the preconditioner application $w=M^{-1}v$ must be of type   
+  the preconditioner application $w=B^{-1}v$ must be of type   
   \verb|psb_|\emph{x}\verb|vect_type| with \emph{x} =    
   \verb|s|, \verb|d|, \verb|c|, \verb|z|, in a manner completely
   analogous to the sparse matrix type;
@@ -100,9 +100,8 @@ contained in \verb|val|.
 %                be chosen according to the real/complex, single/double precision
 %                 version of MLD2P4 under use.\\
 \verb|what|   & \verb|character(len=*)|. \\
-              & The parameter to be set. It can be specified by 
-                a predefined constant, or through its name; the string
-                is case-insensitive. See also
+              & The parameter to be set. It can be specified through its name;
+                the string is case-insensitive. See
                 Tables~\ref{tab:p_cycle}-\ref{tab:p_smoother_1}.\\ 
 \verb|val |   & \verb|integer| \emph{or} \verb|character(len=*)| \emph{or}
                 \verb|real(psb_spk_)| \emph{or} \verb|real(psb_dpk_)|,
@@ -240,7 +239,7 @@ solver is changed to the default sequential solver.
 \textsc{comments} \\ \hline
 %\multicolumn{5}{|c|}{\emph{type of the multi-level preconditioner}}\\ \hline
 %\verb|mld_ml_cycle_| \par 
-\verb|ML_CYCLE|     & \verb|character(len=*)|
+\verb|'ML_CYCLE'|     & \verb|character(len=*)|
                          & \texttt{'VCYCLE'} \par \texttt{'WCYCLE'}   \par \texttt{'KCYCLE'} \par
                              \texttt{'MULT'} \par \texttt{'ADD'}
                          & \texttt{'VCYCLE'}
@@ -249,7 +248,7 @@ solver is changed to the default sequential solver.
                            Note that hybrid Multiplicative Schwarz is equivalent to V-cycle and
                            is included for compatibility with previous versions of MLD2P4. \\ \hline
 %\verb|mld_outer_sweeps_| \par 
- \verb|OUTER_SWEEPS| & \texttt{integer} &
+ \verb|'OUTER_SWEEPS'| & \texttt{integer} &
                            Any integer \par number $\ge 1$  & 1 &
                            Number of multi-level cycles. \\ \hline
 %\verb|mld_smoother_type_| \par \verb|SMOOTHER_TYPE|  & \verb|character(len=*)|
@@ -279,7 +278,7 @@ be applied.
 \textsc{comments} \\ \hline
 %\multicolumn{5}{|c|}{\emph{aggregation algorithm}} \\ \hline
 %\verb|mld_min_coarse_size_|  \par 
-\verb|MIN_COARSE_SIZE| & \verb|integer|
+\verb|'MIN_COARSE_SIZE'| & \verb|integer|
                          & Any number \par $> 0$
                          & $\lfloor 40 \sqrt[3]{n} \rfloor$, where $n$ is the dimension
                             of the matrix at the finest level
@@ -295,7 +294,7 @@ be applied.
 %                         (see \verb|mld_n_prec_levs_|). 
                            \\ \hline  
 %\verb|mld_min_cr_ratio_|  \par 
-\verb|MIN_CR_RATIO| & \verb|real|
+\verb|'MIN_CR_RATIO'| & \verb|real|
                          & Any number \par $> 1$
                          & 1.5
                          & Minimum coarsening ratio. The aggregation stops
@@ -303,13 +302,13 @@ be applied.
                             at two consecutive levels is lower than or equal to this
                             threshold (see Note).\\ \hline   
 %\verb|mld_max_levs_|  \par 
-\verb|MAX_LEVS| & \verb|integer|
+\verb|'MAX_LEVS'| & \verb|integer|
                          & Any integer \par number $> 1$
                          & 20
                          & Maximum number of levels. The aggregation stops
                            if the number of levels reaches this value (see Note). \\ \hline  
 %\verb|mld_par_aggr_alg_|  \par 
-\verb|PAR_AGGR|  & \verb|character(len=*)| \hspace*{-3mm}
+\verb|'PAR_AGGR'|  & \verb|character(len=*)| \hspace*{-3mm}
                          & \texttt{'DEC'}, \texttt{'SYMDEC'}
                          & \texttt{'DEC'}
                          & Parallel aggregation algorithm. \par Currently, only the
@@ -318,13 +317,13 @@ be applied.
                            aggregation to  the sparsity pattern
                            of $A+A^T$.\\ \hline
 %\verb|mld_aggr_type_|  \par 
-\verb|AGGR_TYPE|  & \verb|character(len=*)| \hspace*{-3mm}
+\verb|'AGGR_TYPE'|  & \verb|character(len=*)| \hspace*{-3mm}
                          & \textbf{\texttt{'VMB'}} & \textbf{\texttt{'VMB'}}   
                          & Type of aggregation algorithm: currently, the scalar aggregation
                              algorithm by Van\v{e}k, Mandel and Brezina is implemented
                              \cite{VANEK_MANDEL_BREZINA}. \\ \hline
 %\verb|mld_aggr_prol_|  \par 
-\verb|AGGR_PROL|  & \verb|character(len=*)| \hspace*{-3mm}
+\verb|'AGGR_PROL'|  & \verb|character(len=*)| \hspace*{-3mm}
                          & \texttt{'SMOOTHED'}, \texttt{'UNSMOOTHED'} & \texttt{'SMOOTHED'}
                          & Prolongator used by the aggregation algorithm: smoothed or unsmoothed
                          (i.e., tentative prolongator). \\
@@ -351,7 +350,7 @@ of levels. } \\
 \verb|what|              & \textsc{data type}        &  \verb|val|      &  \textsc{default}  &
 \textsc{comments} \\ \hline
 %\verb|mld_aggr_ord_|  \par 
-\verb|AGGR_ORD|  & \verb|character(len=*)|
+\verb|'AGGR_ORD'|  & \verb|character(len=*)|
                          & \texttt{'NATURAL'} \par \texttt{'DEGREE'}
                          & \texttt{'NATURAL'}
                          & Initial ordering of indices for the aggregation
@@ -361,11 +360,12 @@ of levels. } \\
                             %Since aggregation is
                             %heuristic, results will be different.
 %\verb|mld_aggr_thresh_| \par 
-\verb|AGGR_THRESH| & \verb|real(|\emph{kind\_parameter}\verb|)|
+\verb|'AGGR_THRESH'| & \verb|real(|\emph{kind\_parameter}\verb|)|
                          & Any~real \par number~$\in [0, 1]$
                          & 0.05
-                         & The threshold $\theta$ in the aggregation algorithm
-                            (see Note). \\ \hline
+                         & The threshold $\theta$ in the aggregation algorithm,
+                            see (\ref{eq:strongly_coup}) in Section~\ref{sec:aggregation}.
+                            See also the note at the bottom of this table. \\ \hline
 %%\verb|mld_aggr_scale_| \par 
 % \verb|AGGR_SCALE| & \verb|real(|\emph{kind\_parameter}\verb|)|
 %                         & Any~real \par number~$\in [0, 1]$
@@ -373,40 +373,40 @@ of levels. } \\
 %                         & Scale factor applied to the threshold in going
 %                           from level $ilev$ to level $ilev+1$. \\ \hline
 %\verb|mld_aggr_omega_alg_|  \par 
-\verb|AGGR_OMEGA_ALG|& \verb|character(len=*)|
-                         & \texttt{'EIG\_EST'} \par \texttt{'USER\_CHOICE'}
-                         & \texttt{'EIG\_EST'}
-                         & How the damping parameter $\omega$ in the
-                           smoothed aggregation is obtained:
-                           either via an estimate of the spectral radius of
-                           $D^{-1}A$, where $A$ is the matrix at the current
-                           level and $D$ is the diagonal matrix with
-                           the same diagonal entires as $A$, or explicily
-                           specified by the user. \\ \hline
+%\verb|'AGGR_OMEGA_ALG'|& \verb|character(len=*)|
+%                         & \texttt{'EIG\_EST'} \par \texttt{'USER\_CHOICE'}
+%                         & \texttt{'EIG\_EST'}
+%                         & How the damping parameter $\omega$ in the
+%                           smoothed aggregation is obtained:
+%                           either via an estimate of the spectral radius of
+%                           $D^{-1}A$, where $A$ is the matrix at the current
+%                           level and $D$ is the diagonal matrix with
+%                           the same diagonal entires as $A$, or explicily
+%                           specified by the user. \\ \hline
 %\verb|mld_aggr_eig_|  \par 
-\verb|AGGR_EIG|    & \verb|character(len=*)|
-                         & \texttt{'A\_NORMI'}
-                         & \texttt{'A\_NORMI'}
-                         & How to estimate the spectral radius of $D^{-1}A$.
-                           Currently only the infinity norm estimate
-                           is available. \\ \hline
+%\verb|'AGGR_EIG'|    & \verb|character(len=*)|
+%                        & \texttt{'A\_NORMI'}
+%                         & \texttt{'A\_NORMI'}
+%                         & How to estimate the spectral radius of $D^{-1}A$.
+%                           Currently only the infinity norm estimate
+%                           is available. \\ \hline
 %\verb|mld_aggr_omega_val_| \par 
-\verb|AGGR_OMEGA_VAL|    & \verb|real(|\emph{kind\_parameter}\verb|)|
-                         & Any real \par number $>0$
-                         & $4/(3\rho(D^{-1}A))$
-                         & Damping parameter $\omega$ in the smoothed aggregation algorithm. 
-                           It must be set by the user if
-                           \verb|USER_CHOICE| was specified for 
-                           \verb|mld_aggr_omega_alg_|,
-                           otherwise it is computed by the library, using the
-                           selected estimate of the spectral radius $\rho(D^{-1}A)$ of
-                           $D^{-1}A$.\\ \hline
+%\verb|'AGGR_OMEGA_VAL'|    & \verb|real(|\emph{kind\_parameter}\verb|)|
+%                         & Any real \par number $>0$
+%                         & $4/(3\rho(D^{-1}A))$
+%                         & Damping parameter $\omega$ in the smoothed aggregation algorithm. 
+%                           It must be set by the user if
+%                           \verb|USER_CHOICE| was specified for 
+%                           \verb|mld_aggr_omega_alg_|,
+%                           otherwise it is computed by the library, using the
+%                           selected estimate of the spectral radius $\rho(D^{-1}A)$ of
+%                           $D^{-1}A$.\\ \hline
 %\verb|mld_aggr_filter_| \par 
-\verb|AGGR_FILTER|  
+\verb|'AGGR_FILTER'|  
                          & \verb|character(len=*)|
                          & \texttt{'FILTER'} \par \texttt{'NOFILTER'}
-                         & \texttt{'NOFILTER'} & Matrix used in computing the smoothed
-                            prolongator: filtered or unfiltered. \\
+                        & \texttt{'NOFILTER'} & Matrix used in computing the smoothed
+                           prolongator: filtered or unfiltered (see~(\ref{eq:filtered}) in Section~\ref{sec:aggregation}). \\
 \hline
 \multicolumn{5}{|l|}{{\bfseries Note.} Different thresholds at different levels, such as
 those used in \cite[Section~5.1]{VANEK_MANDEL_BREZINA}, can be easily set  by 
@@ -428,13 +428,13 @@ the parameter \texttt{ilev}.} \\
 \textsc{comments} \\ \hline
 %\multicolumn{5}{|c|}{\emph{coarse-space correction at the coarsest level}}\\ \hline
 %\verb|mld_coarse_mat_|  \par 
-\verb|COARSE_MAT|  & \verb|character(len=*)|
+\verb|'COARSE_MAT'|  & \verb|character(len=*)|
                          & \texttt{'DIST'} \par \texttt{'REPL'}
                          & \texttt{'REPL'}
                          & Coarsest matrix layout: distributed among the processes or
                            replicated on each of them. \\ \hline
 %\verb|mld_coarse_solve_| \par 
-\verb|COARSE_SOLVE| & \verb|character(len=*)|
+\verb|'COARSE_SOLVE'| & \verb|character(len=*)|
                           & \texttt{'MUMPS'} \par \texttt{'UMF'} \par 
                            \texttt{'SLU'} \par \texttt{'SLUDIST'} \par 
                            \texttt{'JACOBI'} \par \texttt{'GS'} \par \texttt{'BJAC'} 
@@ -456,7 +456,7 @@ the parameter \texttt{ilev}.} \\
                           value UMFPACK and SuperLU\_Dist
                            are available only in double precision. \\ \hline
 %\verb|mld_coarse_subsolve_| \par 
-\verb|COARSE_SUBSOLVE| & \verb|character(len=*)|
+\verb|'COARSE_SUBSOLVE'| & \verb|character(len=*)|
                          & \texttt{'ILU'} \par \texttt{'ILUT'} \par \texttt{'MILU'} \par
                             \texttt{'MUMPS'} \par \texttt{'SLU'} \par \texttt{'UMF'} 
                          & See~Note.
@@ -491,18 +491,18 @@ level.\label{tab:p_coarse}}
 \textsc{comments} \\ \hline
 %\multicolumn{5}{|c|}{\emph{coarse-space correction at the coarsest level}}\\ \hline
 %\verb|mld_coarse_sweeps_| \par 
-\verb|COARSE_SWEEPS| & \verb|integer|                         
+\verb|'COARSE_SWEEPS'| & \verb|integer|                         
                          & Any integer \par number $> 0$
                          & 10
                          & Number of sweeps when \verb|JACOBI|, \verb|GS| or \verb|BJAC|
                            is chosen as coarsest-level solver. \\ \hline
 %\verb|mld_coarse_fillin_| \par 
-\verb|COARSE_FILLIN| & \verb|integer|
+\verb|'COARSE_FILLIN'| & \verb|integer|
                          & Any integer \par number $\ge 0$
                          & 0
                          & Fill-in level $p$ of the ILU factorizations. \\ \hline
 %\verb|mld_coarse_iluthrs_|  \par 
-\verb|COARSE_ILUTHRS|
+\verb|'COARSE_ILUTHRS'|
                          & \verb|real(|\emph{kind\_parameter}\verb|)|
                          & Any real \par number $\ge 0$
                          & 0
@@ -523,7 +523,7 @@ level (continued).\label{tab:p_coarse_1}}
 \textsc{comments} \\ \hline
 %\multicolumn{5}{|c|}{\emph{basic one-level preconditioner (smoother)}} \\ \hline
 %\verb|mld_smoother_type_|  \par 
-\verb|SMOOTHER_TYPE|  & \verb|character(len=*)|
+\verb|'SMOOTHER_TYPE'|  & \verb|character(len=*)|
                          & \verb|'JACOBI'| \par \verb|'GS'| \par \verb|'BGS'| \par \verb|'BJAC'|
                             \par \verb|'AS'|
                          & \verb|'FBGS'|
@@ -533,7 +533,7 @@ level (continued).\label{tab:p_coarse_1}}
                             Additive Schwarz. \par
                             It is ignored by one-level preconditioners. \\ \hline
 %\verb|mld_sub_solve_| \par 
-\verb|SUB_SOLVE|  & \verb|character(len=*)|
+\verb|'SUB_SOLVE'|  & \verb|character(len=*)|
                          & \texttt{'JACOBI'} \par
                            \texttt{'GS'} \par \texttt{'BGS'} \par \texttt{'ILU'} \par 
                            \texttt{'ILUT'} \par \texttt{'MILU'} \par 
@@ -550,7 +550,7 @@ level (continued).\label{tab:p_coarse_1}}
                            (plus triangular solve). See Note for details on hybrid
                            Gauss-Seidel. \\ \hline
 %\verb|mld_moother_sweeps_| \par 
-\verb|SMOOTHER_SWEEPS|  & \verb|integer|
+\verb|'SMOOTHER_SWEEPS'|  & \verb|integer|
                          & Any integer \par number~$\ge 0$
                          & 1
                          & Number of sweeps of the smoother or one-level preconditioner.
@@ -559,7 +559,7 @@ level (continued).\label{tab:p_coarse_1}}
                             together with \verb|pos='PRE'| or \verb|pos='POST|,
                            respectively. \\ \hline
 %\verb|mld_sub_ovr_|  \par 
-\verb|SUB_OVR|  & \verb|integer|
+\verb|'SUB_OVR'|  & \verb|integer|
                          & Any integer \par number~$\ge 0$
                          & 1
                          & Number of overlap layers, for Additive Schwarz only. \\
@@ -578,7 +578,7 @@ level (continued).\label{tab:p_coarse_1}}
 \verb|what|              & \textsc{data type}        &  \verb|val|      &  \textsc{default}  &
 \textsc{comments} \\ \hline
 %\verb|mld_sub_restr_|  \par 
-\verb|SUB_RESTR|   & \verb|character(len=*)|
+\verb|'SUB_RESTR'|   & \verb|character(len=*)|
                          & \texttt{'HALO'} \par \texttt{'NONE'}
                          & \texttt{'HALO'}
                          & Type of restriction operator,  for Additive Schwarz only:
@@ -587,7 +587,7 @@ level (continued).\label{tab:p_coarse_1}}
                            Note that \texttt{HALO} must be chosen for
                           the classical Addditive Schwarz smoother and its RAS variant.\\ \hline
 %\verb|mld_sub_prol_| \par 
-\verb|SUB_PROL|   & \verb|character(len=*)|
+\verb|'SUB_PROL'|   & \verb|character(len=*)|
                          & \texttt{'SUM'} \par \texttt{'NONE'}
                          & \texttt{'NONE'}
                          & Type of prolongation operator, for Additive Schwarz only:
@@ -596,12 +596,12 @@ level (continued).\label{tab:p_coarse_1}}
                           Note that \texttt{SUM} must be chosen for the classical Additive
                           Schwarz smoother, and \texttt{NONE} for its RAS variant. \\ \hline
 %\verb|mld_sub_fillin_| \par 
-\verb|SUB_FILLIN|  & \verb|integer|
+\verb|'SUB_FILLIN'|  & \verb|integer|
                          & Any integer \par number~$\ge 0$
                          & 0
                          & Fill-in level $p$ of the incomplete LU factorizations. \\ \hline
 %\verb|mld_sub_iluthrs_| \par 
-\verb|SUB_ILUTHRS|  & \verb|real(|\emph{kind\_parameter}\verb|)|
+\verb|'SUB_ILUTHRS'|  & \verb|real(|\emph{kind\_parameter}\verb|)|
                          & Any real number~$\ge 0$
                          & 0
                          & Drop tolerance $t$ in the ILU($p,t$) factorization. \\ %\hline
@@ -741,7 +741,7 @@ hierarchy produced by a previous call to \verb|hierarchy_build|
 \end{center}
 
 \noindent
-This routine computes $y = op(M^{-1})\, x$, where $M$ is a previously built
+This routine computes $y = op(B^{-1})\, x$, where $B$ is a previously built
 preconditioner, stored into \verb|p|, and $op$
 denotes the preconditioner itself or its transpose, according to
 the value of \verb|trans|.
@@ -770,10 +770,10 @@ and hence it is completely transparent to the user.
 \verb|info|   & \verb|integer, intent(out)|.\\
               & Error code. If no error, 0 is returned. See Section~\ref{sec:errors} for details.\\
 \verb|trans|  & \verb|character(len=1), optional, intent(in).|\\
-              & If \verb|trans| = \verb|'N','n'| then $op(M^{-1}) = M^{-1}$;
-                if \verb|trans| = \verb|'T','t'| then $op(M^{-1}) = M^{-T}$
-                (transpose of $M^{-1})$;  if \verb|trans| = \verb|'C','c'| then $op(M^{-1}) = M^{-C}$
-                (conjugate transpose of $M^{-1})$.\\
+              & If \verb|trans| = \verb|'N','n'| then $op(B^{-1}) = B^{-1}$;
+                if \verb|trans| = \verb|'T','t'| then $op(B^{-1}) = B^{-T}$
+                (transpose of $B^{-1})$;  if \verb|trans| = \verb|'C','c'| then $op(B^{-1}) = B^{-C}$
+                (conjugate transpose of $B^{-1})$.\\
 \verb|work|  & \emph{type}\verb|(|\emph{kind\_parameter}\verb|), dimension(:), optional, target|.\\
              & Workspace. Its size should be at
                least \verb|4 * psb_cd_get_local_| \verb|cols(desc_a)| (see the PSBLAS User's Guide).
@@ -854,66 +854,6 @@ as follows:
 \end{center}
 
 
-\clearpage
-
-\section{Adding new  smoothers and solvers to MLD2P4\label{sec:adding}}
-
-Developers can add completely new smoother and/or solver classes
-derived from the base objects in the library may be used without
-recompiling the library itself. 
-
-To do so it is necessary first to select the base type to be extended;
-in our experience, it is quite likely that the new application needs
-only require the definition of a ``solver'' object, which is almost
-always acting only on the local part of the distributed matrix. 
-
-The parallel actions required to connect the various solver objects
-are most often already provided by the Block Jacobi or the Additive
-Schwarz smoothers.  To define a new solver, the developer will then
-have to define its components and methods, perhaps taking one of the
-predefined solvers as a starting point if possible. 
-
-
-Once the new smoother/solver class has been developed, to use it in
-the context of the multilevel preconditioners it is necessary to:
-\begin{itemize}
-\item Declare in the application program a variable of the new type;
-\item  Pass that variable as the argument to the se routine as in the
-  following:
-\begin{center}
-\verb|call p%set(smoother,info [,ilev, ilmax,pos])|\\
-\verb|call p%set(solver,info [,ilev, ilmax,pos])|
-\end{center}
-\item Link into the application executable the code implementing the
-  various methods.
-\end{itemize}
-The new solver object is then dynamically included in the
-preconditioner structure, and  will act as a \emph{mold} to which the 
-preconditioner will conform, even though the MLD2P4 library has not
-been modified to account for this new development. 
-
-It is possible to define new values for the keyword \verb|WHAT| in the
-\verb|set| routines; if the library code does not recognize a keyword,
-it passes it down the composition hierarchy (levels containing
-smoothers containing solvers), so that it can be eventually caught by
-the new solver. 
-
-An example is contained in the source code distribution under the
-folder \verb|tests/newslv|. This example solver is simply the ILU(0)
-solver under a new name, but it should give an idea of what needs to
-be done. 
-
-\ \\
-
-\begin{tabular}{p{1.2cm}p{12cm}}
-\verb|smoother| & \verb|class(mld_x_base_smoother_type)| \\
-              & The user-defined new smoother to be employed in the
-                preconditioner.\\
-\verb|solver| & \verb|class(mld_x_base_solver_type)| \\
-              & The user-defined new solver to be employed in the
-                preconditioner.
-\end{tabular}
-
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"