mld2p4:

docs/pdf/Makefile docs/pdf/abstract.tex docs/pdf/advanced.tex docs/pdf/background.tex docs/pdf/bibliography.tex docs/pdf/building.tex docs/pdf/conventions.tex docs/pdf/distribution.tex docs/pdf/errors.tex docs/pdf/gettingstarted.tex docs/pdf/highlevelview.tex docs/pdf/listofroutines.tex docs/pdf/overview.tex docs/pdf/userguide.tex docs/userguide.pdf New documentation, first set-up.
17 years ago · 9eeef87a3a
parent 7f1858a775
commit 9eeef87a3a
15 changed files with 3925 additions and 3298 deletions
--- a/docs/pdf/Makefile
+++ b/docs/pdf/Makefile
@ -84,7 +84,9 @@
 #
 TOPFILE   = userguide.tex
-SECFILE   = title.tex intro.tex methods.tex precs.tex 
+SECFILE   = title.tex abstract.tex overview.tex conventions.tex distribution.tex \
 	building.tex gettingstarted.tex highlevelview.tex advanced.tex errors.tex \
 	listofroutines.tex bibliography.tex
 FIGDIR    = figures
 XPDFFLAGS = 
--- a/docs/pdf/abstract.tex
+++ b/docs/pdf/abstract.tex
@ -0,0 +1,19 @@
 \begin{abstract}
 \emph{MLD2P4 (Multi-Level Domain Decomposition Parallel Preconditioners Package based on
 PSBLAS}) is a package of parallel algebraic multi-level preconditioners.
 It implements various versions of one-level additive and of multi-level additive
 and hybrid Schwarz algorithms. In the multi-level case, a purely algebraic approach
 is applied to generate coarse-level corrections, so that no geometric background is needed
 concerning the matrix to be preconditioned. The matrix is required to be square, real or complex, with a symmetric sparsity pattern \textbf{Non consideriamo anche il caso non simmetrico
 con $(A+A^T)/2$?}.
 MLD2P4 has been designed to provide scalable and easy-to-use preconditioners in the
 context of the PSBLAS (Parallel Sparse Basic Linear Algebra Subprograms)
 computational framework and can be used in conjuction with the Krylov solvers
 available in this framework. MLD2P4 enables the user to easily specify different aspects
 of a generic algebraic multilevel Schwarz preconditioner, thus allowing to search
 for the ``best'' preconditioner for the problem at hand. The package has been designed 
 employing object-oriented techniques, using Fortran 95 and MPI, with interfaces to
 additional external libraries such as UMFPACK, SuperLU and SuperLU\_Dist, that
 can be exploited in building multi-level preconditioners.
 \end{abstract}
--- a/docs/pdf/advanced.tex
+++ b/docs/pdf/advanced.tex
@ -0,0 +1,12 @@
 \section{Advanced Use}\label{sec:advanced}
    - MLD2P4 software architecture \\
    - preconditioner data structure (descrizione "dettagliata") + possibilita' di settare singolarmente
      i vari livelli (possibilita' accennata solamente nella precedente descrizione di precset) \\
    - descrizione routine medium level (con introduzione sulle potenzialita' di ampliamento (?), offerte
      da queto strato software) \\
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/background.tex
+++ b/docs/pdf/background.tex
@ -0,0 +1,291 @@
 \section{Multi-level Domain Decomposition Background\label{sec:background}}
 \emph{Domain Decomposition} (DD) preconditioners, coupled with Krylov iterative
 solvers, are widely used in the parallel solution of large and sparse linear systems.
 These preconditioners are based on the divide and conquer technique: the matrix
 to be preconditioned is divided into submatrices, a ``local linear system''
 involving each submatrix is (approximately) solved, and the local solutions are used
 to build a preconditioner for the whole original matrix. This process
 often corresponds to dividing a physical domain associated to the original matrix
 into subdomains, e.g. in a PDE discretization, to (approximately) solving the
 subproblems corresponding to the subdomains and to building an approximate
 solution of the original problem from the local solutions 
 \cite{Cai_Widlund_92,dd1_94,dd2_96}. 
 \emph{Additive Schwarz} preconditioners are DD preconditioners using overlapping
 submatrices, i.e.\ with some common rows, to couple the local information
 related to the submatrices (see, e.g., \cite{dd2_96}).
 The main motivations for choosing Additive Schwarz preconditioners are their
 intrinsic parallelism and good \textbf{(dire good e' un po' "`forte"', dato che
 subito dopo diciamo che la convergenza dipende dal numero di sottomatrici)}
 convergence properties. A drawback of these
 preconditioners is that the number of iterations of the preconditioned solvers
 generally grows with the number of submatrices. This may be a serious limitation
 on parallel computers, since the number of submatrices usually matches the number
 of available processors. Optimal convergence rates, i.e.\ iteration numbers
 independent of the number of submatrices, can be obtained by correcting the
 preconditioner through a suitable approximation of the original linear system
 in a coarse space, which globally couples the information related to the single
 submatrices. 
 \emph{Two-level Schwarz} preconditioners are obtained
 by combining basic (one-level) Schwarz preconditioners with coarse-level
 corrections. In this context, the one-level preconditioner is often
 called smoother. Different two-level preconditioners are obtained by varying the
 choice of the smoother, of the coarse-level correction and the
 way they are combined \cite{dd2_96}. The same reasoning can be applied starting
 from the coarse-level system, i.e.\ a coarse-space correction can be built
 from this system, thus obtaining \emph{multi-level} preconditioners.
 It is worth noting that optimal preconditioners do not necessarily correspond
 to minimum execution times. Indeed, to obtain effective multilevel preconditioners
 a tradeoff between optimality of convergence and the cost of building and applying
 the coarse-space corrections must be achieved. The choice of the number of levels,
 i.e.\ of the coarse-space corrections, also affects the effectiveness of the
 preconditioners. One more goal is to get convergence rates as less sensitive
 as possible to variations in the matrix coefficients.
 Two main approaches can be used to build coarse-space corrections. The geometric approach
 applies coarsening strategies based on the knowledge of some physical grid associated
 to the matrix and requires the user to define grid transfer operators from the fine
 to the coarse levels and vice versa. This may result difficult for complex geometries;
 furthermore, suitable one-level preconditioners may be required to get efficient
 interplay between fine and coarse levels, e.g.\ when matrices with highly varying coefficients
 are considered. The algebraic approach builds coarse-space corrections using only matrix
 information. It performs a fully automatic coarsening and enforces the interplay between
 the fine and coarse levels by suitably choosing the coarse space and the coarse-to-fine
 interpolation \cite{StubenGMD69_99}.
 MLD2P4 uses a pure algebraic approach for building the sequence of coarse matrices
 starting from the original matrix. The algebraic approach is based on the \emph{smoothed 
 aggregation} algorithm \cite{Brezina_Vanek_,Vanek_Mandel_Brezina_}. A decoupled version
 of this algorithm is implemented, where the smoothed aggregation is applied locally
 to each submatrix \cite{Tuminaro_Tong_00}. In the next two subsections we provide
 a brief description of the multi-level Schwarz preconditioners and on the smoothed
 aggregation technique as implemented in MLD2P4. For further details the user
 is referred to \cite{para_04,apnum_07,aaecc_07,dd2_96}.
 \subsection{Multi-level Schwarz Preconditioners\label{sec:multilevel}}
 The Multilevel preconditioners implemented in MLD2P4 are obtained by combining
 Additive Schwarz preconditioners with coarse-space corrections; therefore
 we first provide a sketch of the Additive Schwarz preconditioners.
 Given a linear system
 \[ Ax=b, \]
 where $A=(a_{ij}) \in \Re^{n \times n}$ is a
 nonsingular sparse matrix with a symmetric non-zero pattern,
 let $G=(W,E)$ be the adjacency graph of $A$, where $W=\{1, 2, \ldots, n\}$
 and $E=\{(i,j) : a_{ij} \neq 0\}$ are the vertex set and the edge set of $G$,
 respectively. Two vertices are called adjacent if there is an edge connecting
 them. For any integer $\delta > 0$, a $\delta$-overlap
 partition of $W$ can be defined recursively as follows.
 Given a 0-overlap (or non-overlapping) partition of $W$,
 i.e.\ a set of $m$ disjoint nonempty sets $W_i^0 \subset W$ such that
 $\cup_{i=1}^m W_i^0 = W$, a $\delta$-overlap
 partition of $W$ is obtained by considering the sets
 $W_i^\delta \supset W_i^{\delta-1}$, obtained by including the vertices that
 are adjacent to any vertex in $W_i^{\delta-1}$.
 Let $n_i^\delta$ be the size of $W_i^\delta$ and $R_i^{\delta} \in 
 \Re^{n_i^\delta \times n}$ the restriction operator that maps
 a vector $v \in \Re^n$ onto the vector $v_i^{\delta} \in \Re^{n_i^\delta}$
 containing the components of $v$ corresponding to the vertices in
 $W_i^\delta$. The transpose of $R_i^{\delta}$ is a
 prolongation operator from $\Re^{n_i^\delta}$ to $\Re^n$.
 The matrix $A_i^\delta=R_i^\delta A (R_i^\delta)^T \in
 \Re^{n_i^\delta \times n_i^\delta}$ can be considered
 as a restriction of $A$ corresponding to the set $W_i^{\delta}$.
 The \emph{classical one-level AS} preconditioner is defined by
 \[
 M_{AS}^{-1}= \sum_{i=1}^m (R_i^{\delta})^T 
 (A_i^\delta)^{-1} R_i^{\delta},
 \]
 where $A_i^\delta$ is assumed to be nonsingular. Its application
 to a vector $v \in \Re^n$ within a Krylov solver requires the following
 three steps:
 \begin{enumerate}
 	\item restriction of $v$ as $v_i = R_i^{\delta} v$, $i=1,\ldots,m$;
 	\item (approximate) solution of the linear systems $A_i^\delta w_i = v_i$,
 	      $i=1,\ldots,m$;
 	\item prolongation and sum of the $w_i$'s, i.e. $w = \sum_{i=1}^m (R_i^{\delta})^T w_i$.
 \end{enumerate}
 A variant of the classical AS preconditioner that outperforms it
 in terms of both convergence rate and of computation and communication
 time on parallel distributed-memory computers is the so-called \emph{Restricted AS
 (RAS)} preconditioner~\cite{Cai_Sarkis,Efstathiou_Gander}. It
 is obtained by zeroing the components of $w_i$ corresponding to the
 overlapping vertices when applying the prolongation. Therefore,
 RAS differs from classical AS by the prolongation operator $(R_i^{\delta})^T$,
 which is substituted by $(\tilde{R}_i^0)^T \in \Re^{n_i^\delta \times n}$,
 where $\tilde{R}_i^0$ obtained by zeroing the rows of $R_i^\delta$
 corresponding to the vertices in $W_i^\delta \backslash W_i^0$:
 \[
 M_{RAS}^{-1}= \sum_{i=1}^m (\tilde{R}_i^0)^T 
 (A_i^\delta)^{-1} R_i^{\delta}.
 \]
 Analogously, the AS variant called \emph{AS with Harmonic extension (ASH)}
 is defined by
 \[ M_{ASH}^{-1}= \sum_{i=1}^m (R_i^{\delta})^T 
 (A_i^\delta)^{-1} \tilde{R}_i^0.
 \]
 We note that for $\delta=0$ the three variants of the AS preconditioner are
 all equal to the block-Jacobi preconditioner.
 As already observed, the convergence rate of the one-level Schwarz
 preconditioned iterative solvers deteriorates as the number $m$ of partitions
 of $W$ increases \cite{dd1_94,dd2_96}. To reduce the dependency
 of the number of iterations on the degree of parallelism we may
 introduce a global coupling among the overlapping partitions by defining 
 a coarse-space approximation $A_C$ of the matrix $A$. 
 In a pure algebraic setting, $A_C$ is usually built with
 a Galerkin approach. Given a set $W_C$ of \emph{coarse vertices},
 with size $n_C$, and a suitable restriction operator
 $R_C \in \Re^{n_C \times n}$, $A_C$ is defined as
 \[
 A_C=R_C A R_C^T
 \]
 and the coarse-level correction matrix to be combined with a generic
 one-level AS preconditioner $M_{1L}$ is obtained as
 \[
 M_{C}^{-1}= R_C^T A_C^{-1} R_C,
 \]
 where $A_C$ is assumed to be nonsingular. The application of $M_{C}^{-1}$
 to a vector $v$ corresponds to a restriction, a solution and
 a prolongation step; the solution step, involving the matrix $A_C$,
 may be carried out also approximately.
 The combination of $M_{C}$ and $M_{1L}$ may be
 performed in either an additive or a multiplicative framework.
 In the former case, the \emph{two-level additive} Schwarz preconditioner
 is obtained:
 \[
 M_{2LA}^{-1} = M_{C}^{-1} + M_{1L}^{-1}. 
 \]
 Applying $M_{2L-A}^{-1}$ to a vector $v$ within a Krylov solver
 corresponds to applying $M_{C}^{-1}$
 and $M_{1L}^{-1}$ to $v$ independently and then summing up
 the results.
 In the multiplicative case, the combination can be
 performed by first applying the smoother $M_{1L}^{-1}$ and then
 the coarse-level correction operator $M_{C}^{-1}$:
 \[
 \begin{array}{l}
 w = M_{1L}^{-1} v, \\
 z = w + M_{C}^{-1} (v-Aw);
 \end{array}
 \]
 this corresponds to the following \emph{two-level hybrid pre-smoothed}
 Schwarz preconditioner:
 \[
 M_{2LH-PRE}^{-1} = M_{C}^{-1} + \left( I - M_{C}^{-1}A \right) M_{1L}^{-1}. 
 \]
 On the other hand, by applying the smoother after the coarse-level correction,
 i.e.\ by computing
 \[
 \begin{array}{l}
 w = M_{C}^{-1} v , \\
 z = w + M_{1L}^{-1} (v-Aw) , 
 \end{array}
 \]
 the \emph{two-level hybrid post-smoothed}
 Schwarz preconditioner is obtained:
 \[
 M_{2LH-POST}^{-1} = M_{1L}^{-1} + \left( I - M_{1L}^{-1}A \right) M_{C}^{-1}. 
 \]
 One more variant of two-level hybrid preconditioner is obtained by applying
 the smoother before and after the coarse-level correction. In this case, the
 preconditioner is symmetric if $A$, $M_{1L}$ and $M_{C}$ are symmetric.
 As previously noted, on parallel computers the number of sumatrices usually matches
 the number of available processors. When the size of the system to be preconditioned
 is very large, the use of many proccessors, i.e.\ of many small submatrices, often
 leads to a large coarse-level system, whose solution may be computationally expensive.
 On the other hand, the use of few processors often leads to local sumatrices that
 are too expensive to be processed on single processors, because of memory and/or
 computing requirements. Therefore, it seems natural to use a recursive approach,
 in which the coarse-level correction is re-applied starting from the current
 coarse-level system. The corresponding preconditioners are called \emph{multi-level}.
 One more reason for the multi-level approach is that it may significantly
 reduce the computational cost of preconditioning with respect to the two-level case
 (see \cite[Chapter 3]{dd2_96}). Additive and hybrid multilevel preconditioners
 are obtained as direct extensions of the two-level counterparts. Other combinations
 of the smoothers and coarse-level corrections are possible, leading to variants
 of the previous algorithms. For a detailed descrition of them, the reader is
 referred to \cite[Chapter 3]{dd2_96}.
 \textbf{Secondo me qui ci vorrebbe una descrizione algoritmica, a titolo di esempio,
 di un precondizionatore multilevel, ad esempio quello ibrido con pre-smoothing, sul tipo
 della descrizione in figura 1 della guida di Trilinos ML 4.0. CHE NE PENSATE?}
 \subsection{Smoothed Aggregation\label{sec:aggregation}}
 To define the restriction operator $R_C$, which is used to compute
 the coarse-level matrix $A_C$, MLD2P4 uses the \emph{smoothed aggregation}
 algorithm described in \cite{Brezina_Vanek_,Vanek_Mandel_Brezina_}.
 The basic idea of this algorithm is to build a coarse set of vertices
 $W_C$ by suitably grouping the vertices of $W$ into disjoint subsets
 (aggregates), and to define the coarse-to-fine space transfer operator $R_C^T$ by
 applying a suitable smoother to a simple piecewise constant
 prolongation operator, to improve the quality of the coarse-space correction.
 Three main steps can be identified in the smoothed aggregation procedure:
 \begin{itemize}
 	\item coarsening of the vertex set $W$, to obtain $W_C$;
 	\item construction of the prolongator $R_C^T$;
 	\item application of $R_C$ and $R_C^T$ to build $A_C$.
 \end{itemize}
 To perform the coarsening step, we have implemented the aggregation algorithm sketched
 in \cite{apnum_07}. According to \cite{brezina_vanek}, a modification of this algorithm
 has been actually considered,
 in which each aggregate $N_r$ is made of vertices of $W$ that are \emph{strongly coupled}
 to a certain root vertex $r \in W$, i.e.\
 \[  N_r = \left\{s \in W: |a_{rs}| \geq \theta \sqrt{|a_{rr}a_{ss}|} \right\} \]
 for a given $\theta \in [0,1]$.
 Since the previous algorithm has a sequential nature, a \emph{decoupled} version of
 it has been chosen, where each processor $i$ independently applies the algorithm to
 the set of vertices $W_i^0$ assigned to it in the initial data distribution. This
 version is embarrassingly parallel, since it does not require any data communication.
 On the other hand, it may produce non-uniform aggregates near boundary vertices,
 i.e.\ near vertices adjacent to vertices in other processors, and is strongly
 dependent on the number of processors and on the initial partitioning of the matrix $A$.
 Nevertheless, this algorithm has been chosen for the implementation in MLD2P4,
 since it has been shown to produce good results in practice \cite{Tuminaro_Tong_00}.
 The prolongator $P_C=R_C^T$ is built starting from a \emph{tentative prolongator}
 $P \in \Re^{n \times n_C}$, defined as
 \begin{equation} 
 P=(p_{ij}), \quad  p_{ij}= 
 \left\{ \begin{array}{ll}
 1 & \quad \mbox{if} \; i \in V^j_C \\
 0 & \quad \mbox{otherwise}
 \end{array} \right. .
 \label{eq:tent_prol}
 \end{equation}
 $P_C$ is obtained by
 applying to $P$ a smoother $S \in \Re^{n \times n}$:
 \begin{equation}
 P_C = S P,
 \label{eq:smoothed_prol}
 \end{equation}
 in order to remove oscillatory components from the range of the prolongator
 and hence to improve the convergence properties of the multi-level
 Schwarz method \cite{Brezina_Vanek_,StubenGMD69_99}.
 A simple choice for $S$ is the damped Jacobi smoother:
 \begin{equation}
 S = I - \omega D^{-1} A , 
 \label{eq:jac_smoother}
 \end{equation}
 where the value of $\omega$ can be chosen
 using some estimate of the spectral radius of $D^{-1}A$ \cite{Brezina_Vanek}.
 \textbf{Cenno al filtering di $A$ nello smoothing, dicendo che pero' non e' stato
 implementato?}
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/bibliography.tex
+++ b/docs/pdf/bibliography.tex
@ -0,0 +1,152 @@
 \begin{thebibliography}{99}
 %
 \bibitem{PARA04FOREST}
 Bella, G., Filippone, S., De Maio, A., Testa, M.:
 A Simulation Model for Forest Fires.
 In: Dongarra, J., Madsen, K., Wasniewski, J. (eds.):
 Proceedings of PARA~04 Workshop on State of the Art
 in Scientific Computing. Lecture Notes in Computer Science, 3732. Berlin:
 Springer, 2005
 %
 \bibitem{aaecc_07} A. Buttari, D. di Serafino, P. D'Ambra, S. Filippone,\newblock
 2LEV-D2P4: a package of high-performance preconditioners,\newblock
 Applicable Algebra in Engineering, Communications and Computing, 
 Volume 18, Number 3, May, 2007, pp.  223-239
 %Published online: 13 February 2007, {\tt http://dx.doi.org/10.1007/s00200-007-0035-z}
 %
 \bibitem{apnum_07}  P. D'Ambra, S. Filippone,  D. Di Serafino\newblock
 On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners
 \newblock
 Applied Numerical Mathematics, Elsevier Science, 
 Volume 57, Issues 11-12, November-December 2007, Pages 1181-1196.
 %published online 3 February 2007, {\tt
 %  http://dx.doi.org/10.1016/j.apnum.2007.01.006}
 %% \bibitem{DOUGLAS}
 %% R.E.~Bank and C.C.~Douglas,
 %% {\em SMMP: Sparse Matrix Multiplication Package}, 
 %% Advances in Computational Mathematics, 1993, 1, 127-137.
 %% (See also {\tt http://www.mgnet.org/~douglas/ccd-codes.html}) 
 %
 %
 \bibitem{para_04}
 A.~Buttari, P.~D'Ambra, D.~di Serafino and S.~Filippone,
 {\em Extending PSBLAS to Build Parallel Schwarz Preconditioners},
 in , J.~Dongarra, K.~Madsen, J.~Wasniewski, editors,
 Proceedings of PARA~04 Workshop on State of the Art
 in Scientific Computing, pp.~593--602, Lecture Notes in Computer Science,
 Springer, 2005.
 %
 %% \bibitem{CAI_SAAD}
 %% X.~C.~Cai and Y.~Saad,
 %% {\em Overlapping Domain Decomposition Algorithms for General Sparse Matrices},
 %% Numerical Linear Algebra with Applications, 3(3), pp.~221--237, 1996.
 %% %
 %% \bibitem{CAI_SARKIS}
 %% X.C.~Cai and M.~Sarkis,
 %% {\em A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems},
 %% SIAM Journal on Scientific Computing, 21(2), pp.~792--797, 1999.
 %
 \bibitem{Cai_Widlund_92}
 X.C.~Cai and O.~B.~Widlund,
 {\em Domain Decomposition Algorithms for Indefinite Elliptic Problems},
 SIAM Journal on Scientific and Statistical Computing, 13(1), pp.~243--258, 1992.
 %
 \bibitem{dd1_94}
 T.~Chan and T.~Mathew,
 {\em Domain Decomposition Algorithms},
 in A.~Iserles, editor, Acta Numerica 1994, pp.~61--143, 1994.
 Cambridge University Press.
 %% %
 %% \bibitem{UMFPACK}
 %% T.A.~Davis, 
 %% {\em Algorithm 832: UMFPACK - an Unsymmetric-pattern Multifrontal
 %% Method with a Column Pre-ordering Strategy},
 %% ACM Transactions on Mathematical Software, 30, pp.~196--199, 2004.
 %% (See also {\tt http://www.cise.ufl.edu/~davis/})
 %% %
 %% \bibitem{SUPERLU}
 %% J.W.~Demmel, S.C.~Eisenstat, J.R.~Gilbert, X.S.~Li and J.W.H.~Liu,
 %% A supernodal approach to sparse partial pivoting,
 %% SIAM Journal on Matrix Analysis and Applications, 20(3), pp.~720--755, 1999.
 %
 \bibitem{BLACS}
 J.~J.~Dongarra and R.~C.~Whaley,
 {\em A User's Guide to the BLACS v.~1.1},
 Lapack Working Note 94, Tech.\ Rep.\ UT-CS-95-281, University of
 Tennessee, March 1995 (updated May 1997).
 %
 \bibitem{sblas_97}
 I.~Duff, M.~Marrone, G.~Radicati and C.~Vittoli,
 {\em Level 3 Basic Linear Algebra Subprograms for Sparse Matrices: 
 a User Level Interface},
 ACM Transactions on Mathematical Software, 23(3), pp.~379--401, 1997.
 %
 \bibitem{sblas_02}
 I.~Duff, M.~Heroux and R.~Pozo,
 {\em An Overview of the Sparse Basic Linear
 Algebra Subprograms: the New Standard from the BLAS Technical Forum},
 ACM Transactions on Mathematical Software, 28(2), pp.~239--267, 2002.
 %
 \bibitem{psblas_00}
 S.~Filippone and M.~Colajanni, 
 {\em PSBLAS: A Library for Parallel Linear Algebra
 Computation on Sparse Matrices},
 \newblock
 ACM Transactions on Mathematical Software, 26(4), pp.~527--550, 2000.
 %
 \bibitem{KIVA3PSBLAS}
 S.~Filippone, P.~D'Ambra, M.~Colajanni,
 {\em Using a Parallel Library of Sparse Linear Algebra in a Fluid Dynamics 
 Applications Code on Linux Clusters},
 in G.~Joubert, A.~Murli, F.~Peters, M.~Vanneschi, editors,
 Parallel Computing - Advances \& Current Issues,
 pp.~441--448, Imperial College Press, 2002. 
 %
 \bibitem{METIS}
 Karypis, G. and Kumar, V.,
 {\em {METIS}: Unstructured Graph Partitioning and Sparse Matrix
  Ordering System}.
 Minneapolis, MN 55455: University of Minnesota, Department of
  Computer Science, 1995. 
 Internet Address: {\verb|http://www.cs.umn.edu/~karypis|}.
 \bibitem{BLAS1}
 Lawson, C.,  Hanson, R., Kincaid, D. and Krogh, F.,
   Basic {L}inear {A}lgebra {S}ubprograms for {F}ortran usage,
 {ACM Trans. Math. Softw.} vol.~{5}, 38--329, 1979.
 \bibitem{machiels}
 {Machiels, L. and Deville, M.}
 {\em Fortran 90: An entry to object-oriented programming for the solution
  of partial differential equations.}
 {ACM Trans. Math. Softw.} vol.~{23}, 32--49.
 \bibitem{metcalf}
 {Metcalf, M., Reid, J. and Cohen, M.}
 {\em Fortran 95/2003 explained.}
 {Oxford University Press}, 2004.
 \bibitem{dd2_96}
 B.~Smith, P.~Bjorstad and W.~Gropp,
 {\em Domain Decomposition: Parallel Multilevel Methods for Elliptic
 Partial Differential Equations},
 Cambridge University Press, 1996.
 \bibitem{MPI1}
 M.~Snir, S.~Otto, S.~Huss-Lederman, D.~Walker and J.~Dongarra,
 {\em MPI: The Complete Reference. Volume 1 - The MPI Core}, second edition,
 MIT Press, 1998.
 %
 \bibitem{BREZINA_VANEK}
 M.~Brezina and P.~Van{\v e}k,
 {\em A Black-Box Iterative Solver Based on a Two-Level Schwarz Method},
 Computing, 1999, 63, 233-263.
 %
 %
 \bibitem{VANEK_MANDEL_BREZINA}
 P.~Van{\v e}k, J.~Mandel and M.~Brezina,
 {\em Algebraic Multigrid by Smoothed Aggregation for Second and Fourth Order Elliptic Problems},
 Computing, 1996, 56, 179-196.
 %
 \end{thebibliography}
--- a/docs/pdf/building.tex
+++ b/docs/pdf/building.tex
@ -0,0 +1,7 @@
 \section{Configuring and Building MLD2P4\label{sec:configuring}}
    - uso di GNU autoconf e automake \\
    - software di base necessario (MPI, BLACS, BLAS, PSBLAS - specificare versioni) \\
    - software opzionale (UMFPACK, SuperLU, SuperLUdist - specificare versioni e opzioni di configure) \\
    - sistemi operativi e compilatori su cui MLD2P4 e' stato costruito con successo \\
    - sono previste opzioni di configurazione per il debugging o per il profiling? \\
    - albero delle directory \\
--- a/docs/pdf/conventions.tex
+++ b/docs/pdf/conventions.tex
@ -0,0 +1,6 @@
 \section{Notational Conventions\label{sec:conventions}}
    - caratteri tipografici usati nella guida (vedi guida ML recente e guida Aztec) \\
    - convenzioni sui nomi di routine (differenza tra high-level e medium-level),
      strutture dati,\\
      moduli, costanti, etc. (vedi guida psblas) \\
    - versione reale e complessa\\
--- a/docs/pdf/distribution.tex
+++ b/docs/pdf/distribution.tex
@ -0,0 +1,41 @@
 \section{Code Distribution\label{sec:distribution}}
 The MLD2P4 is freely distributable under the following copyright
 terms:
 \begin{verbatim} 
                         MLD2P4  version 1.0
 MultiLevel Domain Decomposition Parallel Preconditioners Package
           based on PSBLAS (Parallel Sparse BLAS version 2.3)
 (C) Copyright 2008
                    Salvatore Filippone  University of Rome Tor Vergata       
                    Alfredo Buttari      University of Rome Tor Vergata
                    Pasqua D'Ambra       ICAR-CNR, Naples
                    Daniela di Serafino  Second University of Naples
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
 are met:
  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions, and the following disclaimer in the
     documentation and/or other materials provided with the distribution.
  3. The name of the MLD2P4 group or the names of its contributors may
     not be used to endorse or promote products derived from this
     software without specific written permission.
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE MLD2P4 GROUP OR ITS CONTRIBUTORS
 BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 POSSIBILITY OF SUCH DAMAGE.
 \end{verbatim}
--- a/docs/pdf/errors.tex
+++ b/docs/pdf/errors.tex
@ -0,0 +1,9 @@
 \section{Error Handling}\label{sec:errors}
 Error handling
    - Breve descrizione con rinvio alla guida di PSBLAS
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/gettingstarted.tex
+++ b/docs/pdf/gettingstarted.tex
@ -0,0 +1,224 @@
 \section{Getting Started\label{sec:started}}
 We describe the basics for building and applying MLD2P4 one-level and multi-level
 Schwarz preconditioners with the Krylov solvers included in PSBLAS \cite{}.
 The following five steps are required:
 \begin{enumerate}
 \item \emph{Allocate and initialize the preconditioner data structure, according to
 	a preconditioner type chosen by the user}. This is performed by the routine
 	\verb|mld_precinit|, which also sets a default preconditioner for each preconditioner
 	type selected by the user. The default preconditioner associated to each preconditioner
 	type is listed in Table~\ref{tab:precinit}; the string used by \verb|mld_precinit|
 	to identify each preconditioner type is also given. The preconditioner data structure is
 	the derived data type \verb|mld_prec_type|, which is accessed to the user only
 	through the MLD2P4 routines.
 \item \emph{Choose a specific variant of the selected preconditioner type, by setting
  the preconditioner parameters.} This is performed by the routine \verb|mld_precset|.
  A few examples concerning the use of \verb|mld_precset| are given in 
  Sections~\ref{sec:example1} and \ref{sec:example1}; a complete list of all the
  preconditioner parameters and their allowed values is provided in 
  Section~\ref{sec:highlevel}. 
 \item \emph{Build the preconditioner for a given matrix.} This is performed by
  the routine \verb|mld_precbld|.
 \item \emph{Apply the preconditioner at each iteration of a Krylov solver.}
  This is performed by the routine \verb|mld_precaply|. When using the PSBLAS Krylov solvers,
  this step is completely transparent to the user, since \verb|mld_precaply| is called
  by the PSBLAS routine implementing the Krylov solver (\verb|psb_krylov|).
 \item \emph{Deallocate the preconditioner data structure}. This is performed by
  the routine \verb|mld_precfree|. This step is complementary to step 1 and should
  be performed when the preconditioner is no more used.
 \end{enumerate}
 A detailed description of the above routines is given in Section~\ref{sec:highlevel}.
 Note that the Fortran 95 module \verb|mld_prec_mod| must be used in the program
 calling the MLD2P4 routines. Furthermore, to apply MLD2P4 with the Krylov solvers
 from PSBLAS, the module \verb|psb_krylov_mod| must be used too.
 Two simple example programs showing the (basic) use of MLD2P4 are reported in
 Section~\ref{sec:examples}.
 \begin{table}[th]
 {
 \begin{center}
 \begin{tabular}{|l|l|p{6.7cm}|}
 \hline
 Type              & String & Default preconditioner \\ \hline
 No preconditioner &'NOPREC'& (Considered only to use the PSBLAS
                             Krylov solvers with no preconditioner.) \\
 Diagonal          & 'DIAG' & --- \\
 Block Jacobi      & 'BJAC' & ILU(0) on the local blocks.\\ 
 Additive Schwarz  & 'AS'   & Restricted Additive Schwarz (RAS),
                             with overlap 1 and ILU(0) on the local blocks. \\ 
 Multilevel        &'ML'    & Multi-level hybrid preconditioner (additive on the
                             same level and multiplicative through the levels),
                             with post-smoothing only. Number of levels: 2;
                             post-smoother: block-Jacobi preconditioner, with ILU(0)
                             on the local blocks; coarsest matrix: distributed among the
                             processors; corase-level solver: 4 sweeps of the
                             block-Jacobi solver, with ILU(0) on the blocks. \\
 \hline
 \end{tabular}
 \end{center}
 }
 \caption{Preconditioner types and default choices.\label{tab:precinit}}
 \end{table}
 \subsection{Examples\label{sec:examples}}
 The simple code reported below shows how to set and apply the MLD2P4 default multi-level
 preconditioned, i.e.\ the two-level hybrid post-smoothed Schwarz preconditioner, using block-Jacobi with ILU(0) on the blocks as basic preconditioner,
 a coarse matrix distributed among the processors, and four block-Jacobi sweeps with ILU(0) on the blocks as approximate coarse-level solver. The choice of this preconditioner is made
 by simply specifying \verb|'ML'| as second argument of \verb|mld_precinit|
 (a call to \verb|mld_precset| is not needed).
 The preconditioner is applied within the BiCGSTAB solver provided by PSBLAS. 
 The part of the code concerning the
 reading and assembling of the sparse matrix and the right-hand side vector, performed
 through the PSBLAS routines for sparse matrix and vector management, is not reported
 here for brevity. Other statements concerning the use of PSBLAS are neglected too.
 The complete code can be found in the example program file \verb|example_2lev_default.f90|
 in the directory \textbf{XXXXXX (SPECIFICARE).} Note that the modules \verb|psb_base_mod|
 and \verb|psb_util_mod| at the beginning of the code are required by PSBLAS.
 For details on the use of the PSBLAS routines, see the PSBLAS User's Guide \cite{}.
 \begin{verbatim}
  use psb_base_mod
  use psb_util_mod 
  use mld_prec_mod
  use psb_krylov_mod
 ... ...
 !
 ! sparse matrix
  type(psb_dspmat_type) :: A
 ! sparse matrix descriptor
  type(psb_desc_type)   :: DESC_A
 ! preconditioner
  type(mld_prec_type)  :: PRE
 ... ...
 !
 ! initialize the parallel environment
  call psb_init(ictxt)
  call psb_info(ictxt,iam,np)
 ... ...
 !
 ! read and assemble the matrix A and the right-hand
 ! side b using PSBLAS routines for sparse matrix /
 ! vector management
 ... ...
 !
 ! initialize the default multi-level preconditioner
 ! (two-level hybrid post-smoothed Schwarz)
  call mld_precinit(PRE,'ML',info)
 !
 ! build the preconditioner
  call psb_precbld(A,PRE,DESC_A,info)
 !
 ! set the solver parameters and the initial guess
  ... ...
 !
 ! solve Ax=b with preconditioned BiCGSTAB
  call psb_krylov('BICGSTAB',A,PRE,b,x,tol,DESC_A,info)
  ... ...
 !
 ! cleanup the preconditioner
  call mld_precfree(PRE,info)
 !
 ! cleanup other data structures
  ... ...
 !
 ! exit the parallel environment
  call psb_exit(ictxt)
  stop
 \end{verbatim}
 \textbf{MODIFICARE TUTTA LA PARTE CHE SEGUE:\\
 - solo istruzioni diverse dall'esempio precedente (essenzialmente il setting del precondizionatore, magari con piu' chiamate a precset;\\
 - lasciare l'osservazione sulla specifica esplicita del numero di livelli;\\
 - rimandare al paragrafo successivo per una decrizione accurata di tutti i parametri;\\
 - lasciare l'osservazione sui vecchi utenti di PSBLAS.}\\
 In the following we describe the general procedure for setting and building one of the MLD2P4 preconditioners.
 The user has first to prepare the preconditioner data structure by using the routine \verb|mld_precinit|. Input parameters
 for this routine include a string parameter, needed to define the preconditioner type, and an optional integer parameter
 specifying the number of the levels in the case of a multi-level preconditioner.
 Note that if the optional parameter is not present and a multi-level preconditioner has been chosen,
 a two-level preconditioner is set. On the other hand, the integer parameter is ignored if the type of the preconditioner is not multilevel.
 In Table \ref{tab:precinit} we report both the possible choices for the preconditioner type
 and the related default preconditioners. 
 The user of MLD2P4 may set a lot of parameters for one-level and multi-level Schwarz, in order
 to define a different preconditioner than that of default choices. The parameters
 can be set through the routine \verb|mld_precset|. The APIs of \verb|mld_precinit| and  \verb|mld_precset| as well as the complete 
 list of the parameters that can be set with the corresponding allowed values are reported in Section \ref{sec:highlevel}. In the following a simple code
 for a three-level hybrid post-smoothed Schwarz preconditioner, using RAS with overlap 1 as local preconditioner,
 with ILU(0) on the local blocks, a distributed coarse matrix, four block-Jacobi sweeps with the UMFPACK LU
 factorization on the blocks as coarse-matrix solver, is reported. Note that for the multi-level preconditioners, the levels are numbered in increasing
 order starting from the finest one, i.e. level 1 is the finest level. 
 For more details, see the test program \verb|example2.f90| in xxxx(directory dei test).\\[0.5cm]
 \begin{verbatim}
  use psb_base_mod
  use psb_util_mod 
  use mld_prec_mod
  use psb_krylov_mod
 ... ...
 !
 ! sparse matrix
  type(psb_dspmat_type) :: A
 ! sparse matrix descriptor
  type(psb_desc_type)   :: DESC_A
 ! preconditioner data
  type(mld_dprec_type)  :: PRE
 ... ...
 !
 ! initialization of the parallel environment
  call psb_init(ictxt)
  call psb_info(ictxt,iam,np)
 ... ...
 ! read and assemble the matrix A and the right-hand
 ! side vector b using PSBLAS routines for sparse
 ! matrix/vector management
 ... ...
 ! prepare the three-level hybrid post-smoothed Schwarz
 ! using RAS with overlap 1 as local preconditioner
 !
  call mld_precinit(PRE,'ML',info,nlev=3)
  call mld_precset(PRE,mld_n_ovr_,novr=1,info,ilev=1)
  call mld_precset(PRE,mld_sub_restr_,psb_halo_,info,ilev=1)
 NOTA: e' PROPRIO BRUTTO "PSB_HALO_", BISOGNEREBBE AVERE COSTANTI CHE HANNO IL PREFISSO MLD!
 !
 ! build preconditioner
  call psb_precbld(A,PRE,DESC_A,info)
 !
 ! set solver parameters and initial guess
  ... ...
 ! solve Ax=b with preconditioned BiCGSTAB
  call psb_krylov('BICGSTAB',A,PRE,b,x,tol,DESC_A,info)
  ... ...
 !  
 !  cleanup storage and exit
 !
  call mld_precfree(PRE,info)
 !
  call psb_gefree(b,DESC_A,info)
  call psb_gefree(x,DESC_A,info)
  call psb_spfree(A,DESC_A,info)
  call psb_cdfree(DESC_A,info)
 !
  call psb_exit(ictxt)
  stop
 \end{verbatim}
 {\bf Remark for users with PSBLAS-based legacy codes:} when MLD2P4 is installed, a PSBLAS user, with a PSBLAS-based legacy code 
 calling base preconditioners included in PSBLAS (NOPREC, DIAG and BJAC), is able to use the same preconditioners without changes to the code, if she/he
 includes in her/his program the file \verb|psb_prec_mod|.
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/highlevelview.tex
+++ b/docs/pdf/highlevelview.tex
@ -0,0 +1,279 @@
 \section{High-Level User Interface\label{sec:highlevel}}
 At the upper layer of MLD2P4, five black-box routines encapsulate all the functionalities for the construction
 and the application of any of the multi-level preconditioners.
 In the following we give the details of the above routines. Note that for each routine are available four 
 different versions depending on involved data types: Real-Single/Double Precision, Complex-Single/Double Precision.
 \subsection{Preconditioner Setup and Building}\label{sec:setup}
 The setup of a MLD2P4 preconditioner is obtained by using the \verb|mld_precinit| routine, which
 allocates and initializes the preconditioner data structure.
 The API of this routine as well as the description of the arguments is reported in Fig.~\ref{fig:prcinit}.
 Note that the allowed values for the \verb|ptype| argument are reported in Table~\ref{tab:precinit} (Sec. \ref{sec:started}).
 %
 \begin{figure}[h]
 \begin{center}
 {\small
 \begin{verbatim}
 mld_precinit(p,ptype,info,nlev)
 Arguments:
    p       type(mld_dprec_type), input/output. 
            The preconditioner data structure.
    ptype   character, input. The type of preconditioner. 
    info    integer, output. Error code.
    nlev    integer, optional, input. 
            The number of levels of the multilevel preconditioner.
            If nlev is not present and ptype=`ML'/`ml', 
            then nlev=2 is assumed. 
            Otherwise, nlev is ignored.
 \end{verbatim}
 }
 \end{center}
 \caption{API of the routine for preconditioner allocation and inizialization.\label{fig:prcinit}}
 \end{figure}
 %
 %
 \begin{figure}[h]
 \begin{center}
 {\small
 \begin{verbatim}
 mld_precfree(p,info)
 Arguments:
    p       -  type(mld_dprec_type), input/output.
               The preconditioner data structure to be deallocated.
    info    -  integer, output.
               Error code.
 \end{verbatim}
 }
 \end{center}
 \caption{API of the routine for preconditioner deallocation.\label{fig:prcfree}}
 \end{figure}
 A twin routine for deallocation of the preconditioner data structure is the \verb|mld_precfree| routine, whose API is reported in
 Fig.~\ref{fig:prcfree}.
 As mentioned in Section~\ref{sec:multilevel}, a multi-level preconditioner is a combination
 of coarse-level corrections and one-level preconditioner (or smoothers).
 Different combinations of these components together with different type of one-level preconditioner
 as well as different algorithms to build and apply coarse-level corrections allow to the user of defining different multi-level
 preconditioners.
 The user of MLD2P4 may specify the type of multi-level framework (additive or multiplicative), details on the
 aggregation algorithm, details on the type and the way for applying the one-level preconditioner
 (as pre-smoother, post-smoother or both), the coarsest matrix storage
 (distributed or replicated), the type of the solver to be employed at the coarsest level
 and related details, by setting some parameters through the routine \verb|mld_precset| (see Section~\ref{sec:list}).
 The API of this routine is reported in Fig.~\ref{fig:prcset}.
 %
 \begin{figure}[h]
 \begin{center}
 {\small
 \begin{verbatim}
 mld_precset(p,what,val,info,ilev)
 Arguments:
    p       -  type(mld_dprec_type), input/output.
               The preconditioner data structure.
    what    -  integer, input.
               The number identifying the parameter to be set.
               A mnemonic constant has been associated to each of these
               numbers.
    val     -  integer/character, input.
               The value of the parameter to be set. 
    info    -  integer, output.
               Error code.
    ilev    -  integer, optional, input.
               For the multilevel preconditioner, the level at which the
               preconditioner parameter has to be set. 
               If nlev is not present, the parameter identified by 'what'
               is set at all the appropriate levels.
 \end{verbatim}
 }
 \end{center}
 \caption{API of the routine for preconditioner setup.\label{fig:prcset}}
 \end{figure}
 %
 Finally, to build a preconditioner, according to the requirements made trough the routines \verb|mld_precinit| and \verb|mld_precset|,
 a user of MLD2P4 have to call the \verb|prec_build| routine, whose API is reported in Figure~\ref{fig:prcbld}.
 %
 \begin{figure}[h]
 \begin{center}
 {\small
 \begin{verbatim} 
 mld_precbld(a,desc_a,prec,info)
 Arguments:
    a       -  type(psb_dspmat_type).
               The sparse matrix structure containing the local part of the
               matrix to be preconditioned.
    desc_a  -  type(psb_desc_type), input.
               The communication descriptor of a.
    p       -  type(mld_dprec_type), input/output.
               The preconditioner data structure containing the local part
               of the preconditioner to be built.
    info    -  integer, output.
               Error code.              
 \end{verbatim}
 }
 \end{center}
 \caption{API of the routine for preconditioner building.\label{fig:prcbld}}
 \end{figure}
 \subsubsection{List of the preconditioner parameters\label{sec:list}}
 In the following we report the list of possible parameters to be set through the \verb|mld_precset| routine,
 in order to choose the type of multi-level preconditioner. The parameters are classified depending on their scope.
 Note that for character data both uppercase and lowercase strings are allowed.
 \begin{table}[h]
 {\small \label{tab:prec_type}
 \begin{tabular}{ll}
 Parameter (\verb|what|)   & Allowed values ( \verb|val|)\\
 \verb|mld_ml_type_|       & 'ADD', 'MULT'\\
                          & Define the type of multi-level preconditioner.\\
 \verb|mld_prec_type_|     & 'DIAG', 'BJAC', 'AS' \\
                          & Define the smoother at a certain level.\\
 \verb|mld_smooth_pos_|    & 'PRE', 'POST', 'BOTH'\\
                          & Define the way to apply the smoother.\\ 
 \end{tabular}
 \caption{Parameters for preconditioner type.}
 }
 \end{table}
 In order to build a coarse matrix from a fine one, this version of MLD2P4 implements the
 smoothed aggregation algorithm described in Section~\ref{sec:aggregation}. However, since for nonsymmetric problems the
 application of a correct smoothed procedure is yet an open problem~\cite{lin}, the user
 may also choose to apply a nonsmoothed aggregation technique, where the prolongator operator from 
 the coarse to fine-space vertices is the simple piecewice constant interpolation
 (the tentative prolongator) operator defined in Section~\ref{sec:aggregation}. 
 The coarsening scheme takes into account possible anisotropic features of the problems, by using
 a threshold level to be used for dropping matrix coefficients during the process. 
 The parallel implementation of the coarsening algorithm is based on a decoupled approach, where each process applies the coarsening scheme
 to its own local data. The uncoupled scheme can be applied to the matrix $A+A^T$, in the case of matrices with nonsymmetric sparsity pattern.
 In the Table \ref{tab:aggr_type} we list the parameters that the user can specify for the aggregation algorithm.
 \begin{table}[h]
 {\small \label{tab:aggr_type} 
 \begin{tabular}{ll}
 Parameter               & Allowed values \\
 (\verb|what|)           & ( \verb|val|)\\
 \verb|mld_aggr_alg_|    & 'DEC', 'SYMDEC'\\
                        & Define the aggregation scheme\\
                        & Now, only decoupled aggregation is available \\
                        & (if 'SYMDEC' is set, the symmetric part of the matrix is considered)\\
 \verb|mld_aggr_kind_|   & 'SMOOTH', 'RAW'\\
                        & Define the type of aggregation technique (smoothed or nonsmoothed).\\
 \verb|mld_aggr_thresh_| & Dropping threshold in aggregation.\\
                        & Default 0.0\\
 \verb|mld_aggr_eig_|    & NON E' DEFINITA LA STRINGA CORRISPONDENTE a mldmaxnorm\\
                        & Define the algorithm to evaluate the maximum eigenvalue\\
                        & of $D^{-1}A$ for smoothed aggregation. Now only the A-norm of the\\
                        & matrix is available.\\
 \end{tabular}
 \caption{Parameters for aggregation type.}
 }
 \end{table}
 Some options are available for the system involving the coarsest matrix. 
 Indeed, this matrix can be replicated or distributed among the processors.
 In the former case, various versions of incomplete LU (ILU) factorizations of the 
 coarsest matrix are available in order to solve the coarsest system.
 In the current version of MLD2P4, the following factorizations are available~\cite{saad}:
 \begin{description}
 \item[ILU(k):] ILU factorization with fill-in level $k$;
 \item[MILU(k):] modified ILU factorization with fill-in level $k$;
 \item[ILU(k,t):] ILU with threshold $t$ and $k$ additional entries in each row of the L and U factors with respect to the initial sparsity pattern.
 \end{description}
 Furthermore, interfaces to UMFPACK~\cite{UMFPACK}, version 4.4, and to SuperLU package~\cite{SUPERLU}, version 3.0, have been also available to deal 
 with the coarsest system, when the coarsest matrix is replicated among the processors.
 On the other hand, to solve the coarsest-level system when the coarsest matrix is distributed,
 a block-Jacobi routine has been developed. It uses the different versions of ILU or the LU
 factorization on the coarse matrix diagonal blocks held by the processors. In the case of
 distributed coarsest matrix is also available an interface to SupeLU$\_$dist~\cite{SUPERLUDIST}, version 2.0, for distributed 
 sparse factorization and solve.
 See the Table \ref{tab:coarse_mat} for details. 
 \begin{table}[h]
 {\small \label{tab:coarse_mat}
 \begin{tabular}{ll}
 Parameter & Allowed values\\
 ( \verb|what|) & ( \verb|val|)\\
 \verb|mld_coarse_mat_|         & 'DISTR', 'REPL' \\
                               & Coarse Matrix: distributed or replicated \\
 \verb|mld_coarse_solve_|       & 'ILU', 'MILU', 'ILUT', 'SLU', 'UMF', SLUDIST', BJAC????\\
                               & Available Coarse solver.\\
                               & Only SLUDIST e BJAC can be used when coarse matrix is distributed\\
 \verb|mld_coarse_BJAC_sweeps_| & (NON VA BENE mldcoarsesweeps) number of Block-Jacobi sweeps when BJAC is used as coarsest solver\\
 \verb|mld_coarse_fill_in_|     & level of fill-in in MILU and ILU factorization\\
                               & E IL THRESHOLD PER ILUT? \\
 \end{tabular}
 \caption{Parameters for coarsest matrix solver.}
 }
 \end{table}
 When a Schwarz algorithm is considered as smoother at a certain level or as one-level preconditioner, the user may set many parameters 
 in order to choose the type of additive Schwarz version (AS,RAS,ASH), the number of overlaps as well as the local solver. 
 All the parameters are reported in Table \ref{tab:schwarz_type}.
 \begin{table}[h]
 {\small \label{tab:schwarz_type}
 \begin{tabular}{ll}
 Parameter & Allowed values\\
 (\verb|what|) & (\verb|val|)\\
 \verb|mld_n_ovr_|            & Number of overlaps \\
 \verb|mld_sub_restr_|        & 'HALO', 'NONE'\\
 \verb|mld_sub_prol_|         & 'SUM', 'NONE'\\
 \verb|mld_sub_solve_|        & 'ILU', 'MILU', 'ILUT', 'SLU', 'UMF'\\
 \verb|mld_sub_ren_|          & MANCANO LE STRINGHE\\
 \verb|mld_sub_fill_in_|      & level of fill-in in local diagonal blocks, when ILU-type factorizations are used\\
 \end{tabular}
 \caption{Parameters for Schwarz smoother/preconditioner type.}
 }
 \end{table}
 Its worth noting that, the classical AS method corresponds to the couple of values 'HALO' and 'SUM' of the argument \verb|val|, 
 for the values \verb|mld_sub_restr_| and \verb|mld_sub_prol_| of the argument \verb|what|, respectively. While, the RAS method corresponds to 
 the couple of values 'NONE' and 'SUM' and ASH method corresponds to the couple of values 'HALO' and 'NONE'.
 \subsection{Preconditioner Application} \label{sec:application}
 Once the preconditioner has been built, it may be applied at each iteration
 of a Krylov solver by calling the routine \verb|mld_precaply| (CAMBIARE NOME ROUTINE NEL SOFTWARE EVITANDO L'UNDERSCORE),
 whose API is shown in Figure~\ref{fig:prcaply}.
 This routine computes $y = op(M^{-1})\, x$, where $M$ is the previously built
 preconditioner, stored in the \verb|prec| data structure, and $op$
 denotes the matrix itself or its transpose, according to the value of \verb|trans|.
 Note that this routine is called within the PSBLAS-based Krylov solver available in the PSBLAS library (see the PSBLAS User's Guide for details), 
 therefore, the use of this routine is generally transparent to the MLD2P4 user.
 %
 \begin{figure}[h]
 \begin{center}
 {\small
 \begin{verbatim} 
   mld_precaply(prec,x,y,desc_data,info,trans,work)
 Arguments:
    prec       -  type(mld_dprec_type), input.
                  The preconditioner data structure containing the local part
                  of the preconditioner to be applied.
    x          -  real(psb_dpk_), dimension(:), input.
                  The local part of the vector X in Y := op(M^(-1)) * X.
    y          -  real(psb_dpk_), dimension(:), output.
                  The local part of the vector Y in Y := op(M^(-1)) * X.
    desc_data  -  type(psb_desc_type), input.
                  The communication descriptor associated to the matrix to be
                  preconditioned.
    info       -  integer, output.
                  Error code.
    trans      -  character(len=1), optional.
                  If trans='N','n' then op(M^(-1)) = M^(-1);
                  if trans='T','t' then op(M^(-1)) = M^(-T) (transpose of M^(-1)).
    work       -  real(psb_dpk_), dimension (:), optional, target.
                  Workspace. Its size must be at
                  least 4*psb_cd_get_local_cols(desc_data).
 \end{verbatim}
 }
 \end{center}
 \caption{API of the routine for preconditioner application.\label{fig:prcaply}}
 \end{figure}
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/listofroutines.tex
+++ b/docs/pdf/listofroutines.tex
@ -0,0 +1,10 @@
 \section{List of Routines}\label{sec:routines}
   Elenco (ordine alfabetico) di tutte le routine, con rinvio (ipertestuale e num. pag.) alla descrizione
     di ciascuna in qualche paragrafo precedente
     (una specie di indice analitico, che rimanda alle routine descritte precedentemente nei rispettivi paragrafi)
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/overview.tex
+++ b/docs/pdf/overview.tex
@ -0,0 +1,62 @@
 \section{General Overview\label{sec:overview}}
 The \emph{Multi-Level Domain Decomposition Parallel Preconditioners Package based on
 PSBLAS (MLD2P4}) provides various versions of multi-level Schwarz preconditioners~\cite{DD2},
 to be used in the iterative solutions of sparse linear systems $Ax=b$, where
 $A$ is a square, real or complex, sparse matrix with a symmetric sparsity pattern.
 \textbf{Ma non abbiamo detto che, se il pattern di sparista' non e' simmetrico,
 lavoriamo su $(A+A^T)/2$? Ma questo vale solo per l'aggregazione? Dovremmo fare
 qualcosa di consistente anche con 1-lev Schwarz.}
 Both additive and hybrid preconditioners, i.e.\ multiplicative among the levels
 and additive inside a level, are implemented; the basic additive Schwarz preconditioners
 are obtained by considering only one level. A purely algebraic approach is used to
 generate a sequence of coarse-level corrections to a basic preconditioner, without
 explicitly using any information on the geometry of the original problem (e.g.\ the
 discretization of a PDE). The smoothed aggregation technique is applied
 as algebraic coarsening strategy~\cite{}.
 The package is written in Fortran~95, using object-oriented techniques,
 and is based on a distributed-memory parallel programming paradigm. \textbf{SALVATORE,
 potresti aggiungere due righe sulla scelta del Fortran 95 e sul semplice interfacciamento
 con i legacy codes, senza ripetere quello che e' detto sotto sulla scelta di PSBLAS?}
 Single and double precision implementations of MLD2P4 are available for both the
 real and the complex case, that can be used through a single interface.
 \textbf{SALVATORE, funziona tutto?}
 MLD2P4 has been designed to implement scalable and easy-to-use multilevel preconditioners
 in the context of the PSBLAS (Parallel Sparse BLAS) computational framework~\cite{}.
 PSBLAS is a library originally developed to address the parallel implementation of
 iterative solvers for sparse linear system, by providing basic linear algebra
 operators and data management facilities for distributed sparse matrices; it
 also includes parallel Krylov solvers, built on the top of the basic PSBLAS kernels.
 The preconditioners available in MLD2P4 can be used with these Krylov solvers.
 The choice of PSBLAS has been mainly motivated by the need of having
 a portable and efficient software infrastructure implementing ``de facto'' standard
 parallel sparse linear algebra kernels, to pursue goals such as performance,
 portability, modularity ed extensibility in the development of the preconditioner
 package. On the other hand, the implementation of MLD2P4 has led to some
 revisions and extentions of the PSBLAS kernels, leading to the
 recent PSBLAS 2.0 version~\cite{}. The inter-process comunication required
 by MLD2P4 is encapsulated into the PSBLAS routines, except few cases where
 MPI~\cite{} is explicitly called. Therefore, MLD2P4 can be run on any parallel
 machine where PSBLAS and MPI implementations are available.
 MLD2P4 has a layered and modular software architecture where three main layers can be identified. The lower layer consists of the PSBLAS kernels, the middle one implements
 the construction and application phases of the preconditioners, and the upper one
 provides a uniform and easy-to-use interface to all the preconditioners. 
 This architecture allows for different levels of use of the package:
 few black-box routines at the upper level allow non-expert users to easily
 build any preconditioner available in MLD2P4 and to apply it within a PSBLAS Krylov solver.
 On the other hand, the routines of the middle and lower layer can be used and extended
 by expert users to build new versions of multi-level Schwarz preconditioners.\\
 \textbf{Organizzazione della guida:\\
 dire che per il momento non
 forniamo anche la documentazione del middle layer, ma lo faremo in seguito\\}
 \textbf{Evidenziare le parole chiave che caratterizzano il nostro package}
 %%% Local Variables: 
 %%% mode: latex
 %%% TeX-master: "userguide"
 %%% End: 
--- a/docs/pdf/userguide.tex
+++ b/docs/pdf/userguide.tex
@ -1,4 +1,4 @@
-\documentclass[10pt,a4paper,twoside]{article}
+\documentclass[11pt,a4paper,twoside]{article}
 \usepackage{pstricks}
 \usepackage{fancybox}
 \usepackage{amsfonts}
@ -22,17 +22,17 @@
 \pdfoutput=1
 \relax
 \pdfcompresslevel=0             %-- 0 = none, 9 = best
-\pdfinfo{                       %-- Info dictionary of PDF output  /Author (Alfredo Buttari)
+\pdfinfo{                       %-- Info dictionary of PDF output  /Author (PD, DdS, SF)
  /Title    (MultiLevel Domain Decomposition Parallel Preconditioners Package
-             based on PSBLAS V. 1.0)
+             based on PSBLAS, V. 1.0)
-  /Subject ( MultiLevel Domain Decomposition Parallel Preconditioners
+  /Subject  (MultiLevel Domain Decomposition Parallel Preconditioners Package)
-  Package)
+  /Keywords (Parallel Numerical Software, Algebraic Multilevel Preconditioners, Sparse Iterative Solvers, PSBLAS, MPI)
  /Keywords (Computer Science Linear Algebra Fluid Dynamics Parallel Linux MPI PSBLAS Iterative Solvers Preconditioners)
  /Creator  (pdfLaTeX)
-  /Producer ($Id: userguide.tex 1978 2007-10-19 14:51:12Z sfilippo $)
+  /Producer ($Id: userguide.tex 2008-04-08 Pasqua D'Ambra, Daniela di Serafino,
             Salvatore Filippone$)
 }
 \pdfcatalog{ %-- Catalog dictionary of PDF output.
-  /URI (http://ce.uniroma2.it/psblas)
+%  /URI (http://ce.uniroma2.it/psblas)
 } 
 \newcounter{subroutine}[subsection]
@ -78,175 +78,43 @@
 \begin{document}
 \include{title}
 %\cleardoublepage
 \clearpage
 \ \\
 \thispagestyle{empty}
 \clearpage
 \pagenumbering{roman}   % Roman numbering
 \setcounter{page}{1}    % Abstract start on page i
 \include{abstract}
 \cleardoublepage
 \begingroup
  \renewcommand*{\thepage}{toc}
-  \pagenumbering{roman}   % Roman numbering
+  %\pagenumbering{roman}   % Roman numbering
-  \setcounter{page}{1}    % Abstract start on page ii
+  %\setcounter{page}{1}    % Abstract start on page ii
  \tableofcontents
 \endgroup  
 \cleardoublepage
 \pagenumbering{arabic}  % Arabic numbering
 \setcounter{page}{1}    % Chapters start on page 1
-\include{intro}
+\include{overview}
-\include{precs}
+\include{conventions}
-\include{methods}
+\include{distribution}
 \include{building}
 \include{background}
 \include{gettingstarted}
 \include{highlevelview}
 \include{advanced}
 \include{errors}
 \include{listofroutines}
 \cleardoublepage
-\begin{thebibliography}{99}
+\include{bibliography}
 \bibitem{PARA04FOREST}
 G.~Bella, S.~Filippone, A.~De Maio and M.~Testa,
 {\em A Simulation Model for Forest Fires},
 in J.~Dongarra, K.~Madsen, J.~Wasniewski, editors,
 Proceedings of PARA~04 Workshop on State of the Art
 in Scientific Computing, pp.~546--553, Lecture Notes in Computer Science,
 Springer, 2005.
 \bibitem{2007d} A. Buttari, D. di Serafino, P. D'Ambra, S. Filippone,\newblock
 2LEV-D2P4: a package of high-performance preconditioners,\newblock
 Applicable Algebra in Engineering, Communications and Computing, 
 Volume 18, Number 3, May, 2007, pp.  223-239
 %Published online: 13 February 2007, {\tt http://dx.doi.org/10.1007/s00200-007-0035-z}
 %
 \bibitem{2007c}  P. D'Ambra, S. Filippone,  D. Di Serafino\newblock
 On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners
 \newblock
 Applied Numerical Mathematics, Elsevier Science, 
 Volume 57, Issues 11-12, November-December 2007, Pages 1181-1196.
 %published online 3 February 2007, {\tt
 %  http://dx.doi.org/10.1016/j.apnum.2007.01.006}
 \bibitem{BLAS2}
 Dongarra, J. J.,  DuCroz, J.,  Hammarling, S. and Hanson, R.,
 An Extended Set of {F}ortran {B}asic {L}inear {A}lgebra {S}ubprograms,
 {ACM Trans. Math. Softw.} vol.~{14}, 1--17, 1988.
 \bibitem{BLAS3}
  Dongarra, J., DuCroz, J., Hammarling, S. and Duff, I.,
 A  Set of level 3 Basic Linear Algebra Subprograms,
 {ACM Trans. Math. Softw.} vol.~{16}, 1--17, 1990.
 %% \bibitem{DOUGLAS}
 %% R.E.~Bank and C.C.~Douglas,
 %% {\em SMMP: Sparse Matrix Multiplication Package}, 
 %% Advances in Computational Mathematics, 1993, 1, 127-137.
 %% (See also {\tt http://www.mgnet.org/~douglas/ccd-codes.html}) 
 %
 %
 %% \bibitem{PARA04}
 %% A.~Buttari, P.~D'Ambra, D.~di Serafino and S.~Filippone,
 %% {\em Extending PSBLAS to Build Parallel Schwarz Preconditioners},
 %% in , J.~Dongarra, K.~Madsen, J.~Wasniewski, editors,
 %% Proceedings of PARA~04 Workshop on State of the Art
 %% in Scientific Computing, pp.~593--602, Lecture Notes in Computer Science,
 %% Springer, 2005.
 %
 %% \bibitem{CAI_SAAD}
 %% X.~C.~Cai and Y.~Saad,
 %% {\em Overlapping Domain Decomposition Algorithms for General Sparse Matrices},
 %% Numerical Linear Algebra with Applications, 3(3), pp.~221--237, 1996.
 %% %
 %% \bibitem{CAI_SARKIS}
 %% X.C.~Cai and M.~Sarkis,
 %% {\em A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems},
 %% SIAM Journal on Scientific Computing, 21(2), pp.~792--797, 1999.
 %
 %% \bibitem{CAI_WIDLUND}
 %% X.C.~Cai and O.~B.~Widlund,
 %% {\em Domain Decomposition Algorithms for Indefinite Elliptic Problems},
 %% SIAM Journal on Scientific and Statistical Computing, 13(1), pp.~243--258, 1992.
 %
 %% \bibitem{DD1}
 %% T.~Chan and T.~Mathew,
 %% {\em Domain Decomposition Algorithms},
 %% in A.~Iserles, editor, Acta Numerica 1994, pp.~61--143, 1994.
 %% Cambridge University Press.
 %% %
 %% \bibitem{APNUM06}
 %% P.~D'Ambra, D.~di Serafino and S.~Filippone,
 %% On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners,
 %% Applied Numerical Mathematics, to appear, 2007.
 %
 %% \bibitem{UMFPACK}
 %% T.A.~Davis, 
 %% {\em Algorithm 832: UMFPACK - an Unsymmetric-pattern Multifrontal
 %% Method with a Column Pre-ordering Strategy},
 %% ACM Transactions on Mathematical Software, 30, pp.~196--199, 2004.
 %% (See also {\tt http://www.cise.ufl.edu/~davis/})
 %% %
 %% \bibitem{SUPERLU}
 %% J.W.~Demmel, S.C.~Eisenstat, J.R.~Gilbert, X.S.~Li and J.W.H.~Liu,
 %% A supernodal approach to sparse partial pivoting,
 %% SIAM Journal on Matrix Analysis and Applications, 20(3), pp.~720--755, 1999.
 %
 \bibitem{BLACS}
 J.~J.~Dongarra and R.~C.~Whaley,
 {\em A User's Guide to the BLACS v.~1.1},
 Lapack Working Note 94, Tech.\ Rep.\ UT-CS-95-281, University of
 Tennessee, March 1995 (updated May 1997).
 %
 \bibitem{sblas97}
 I.~Duff, M.~Marrone, G.~Radicati and C.~Vittoli,
 {\em Level 3 Basic Linear Algebra Subprograms for Sparse Matrices: 
 a User Level Interface},
 ACM Transactions on Mathematical Software, 23(3), pp.~379--401, 1997.
 %
 \bibitem{sblas02}
 I.~Duff, M.~Heroux and R.~Pozo,
 {\em An Overview of the Sparse Basic Linear
 Algebra Subprograms: the New Standard from the BLAS Technical Forum},
 ACM Transactions on Mathematical Software, 28(2), pp.~239--267, 2002.
 \bibitem{PSBLAS}
 S.~Filippone and M.~Colajanni, 
 {\em PSBLAS: A Library for Parallel Linear Algebra
 Computation on Sparse Matrices},
 \newblock
 ACM Transactions on Mathematical Software, 26(4), pp.~527--550, 2000.
 %
 \bibitem{KIVA3PSBLAS}
 S.~Filippone, P.~D'Ambra, M.~Colajanni,
 {\em Using a Parallel Library of Sparse Linear Algebra in a Fluid Dynamics 
 Applications Code on Linux Clusters},
 in G.~Joubert, A.~Murli, F.~Peters, M.~Vanneschi, editors,
 Parallel Computing - Advances \& Current Issues,
 pp.~441--448, Imperial College Press, 2002. 
 %
 \bibitem{METIS}
 Karypis, G. and Kumar, V.,
 {\em {METIS}: Unstructured Graph Partitioning and Sparse Matrix
  Ordering System}.
 Minneapolis, MN 55455: University of Minnesota, Department of
  Computer Science, 1995. 
 Internet Address: {\verb|http://www.cs.umn.edu/~karypis|}.
 \bibitem{BLAS1}
 Lawson, C.,  Hanson, R., Kincaid, D. and Krogh, F.,
   Basic {L}inear {A}lgebra {S}ubprograms for {F}ortran usage,
 {ACM Trans. Math. Softw.} vol.~{5}, 38--329, 1979.
 \bibitem{machiels}
 {Machiels, L. and Deville, M.}
 {\em Fortran 90: An entry to object-oriented programming for the solution
  of partial differential equations.}
 {ACM Trans. Math. Softw.} vol.~{23}, 32--49.
 \bibitem{metcalf}
 {Metcalf, M., Reid, J. and Cohen, M.}
 {\em Fortran 95/2003 explained.}
 {Oxford University Press}, 2004.
 %
 %% \bibitem{DD2}
 %% B.~Smith, P.~Bjorstad and W.~Gropp,
 %% {\em Domain Decomposition: Parallel Multilevel Methods for Elliptic
 %% Partial Differential Equations},
 %% Cambridge University Press, 1996.
 %
 \bibitem{MPI1}
 M.~Snir, S.~Otto, S.~Huss-Lederman, D.~Walker and J.~Dongarra,
 {\em MPI: The Complete Reference. Volume 1 - The MPI Core}, second edition,
 MIT Press, 1998.
 %
 \end{thebibliography}
 \end{document}
 %%% Local Variables: 
--- a/docs/userguide.pdf
+++ b/docs/userguide.pdf