psblas3/docs/pdf/intro.tex

\section{Introduction}

The PSBLAS library, developed with the aim to facilitate the
parallelization of computationally intensive scientific applications,
is designed to address parallel implementation of iterative solvers
for sparse linear systems through the distributed memory paradigm.  It
includes routines for multiplying sparse matrices by dense matrices,
solving block diagonal systems with triangular diagonal entries,
preprocessing sparse matrices, and contains additional routines for
dense matrix operations.  The current implementation of PSBLAS
addresses a distributed memory execution model operating with message
passing. 

The PSBLAS library is internally implemented in a mixture of
Fortran~77 and Fortran~95~\cite{metcalf} programming languages. A
similar approach has been advocated by a number of authors,
e.g.~\cite{machiels}.  Moreover, the Fortran~95 facilities for dynamic
memory management and interface overloading greatly enhance the usability of the PSBLAS
subroutines. In this way, the library can take care of runtime memory
requirements that are quite difficult or even impossible to predict at
implementation or compilation time.  The following presentation of the
PSBLAS library follows the general structure of the proposal for
serial Sparse BLAS~\cite{sblas97,sblas02}, which in its turn is based on the
proposal for BLAS on dense matrices~\cite{BLAS1,BLAS2,BLAS3}.

The applicability of sparse iterative solvers to many different areas
causes some terminology problems because the same concept may be
denoted through different names depending on the application area. The
PSBLAS features presented in this section will be discussed mainly in terms of finite
difference discretizations of Partial Differential Equations (PDEs).
However, the scope of the library is wider than that: for example, it
can be applied to finite element discretizations of PDEs, and even to
different classes of problems such as nonlinear optimization, for
example in optimal control problems.

The design of a solver for sparse linear systems is driven by many
conflicting objectives, such as limiting occupation of storage
resources, exploiting regularities in the input data, exploiting
hardware characteristics of the parallel platform.  To achieve an
optimal communication to computation ratio on distributed memory
machines it is essential to keep the {\em data locality} as high as
possible; this can be done through an appropriate data allocation
strategy.  The choice of the preconditioner is another very important
factor that affects efficiency of the implemented application. Optimal
data distribution requirements for a given preconditioner may conflict
with distribution requirements of the rest of the solver. Finding the
optimal trade-off may be very difficult because it is application
dependent.  Possible solution to these problems and other important
inputs to the development of the PSBLAS software package has come from
an established experience in applying the PSBLAS solvers to
computational fluid dynamics applications.

\section{General overview}
\label{sec:overview} 
The PSBLAS library is designed to handle the implementation of
iterative solvers for sparse linear systems on distributed memory
parallel computers.  The system coefficient matrix $A$ must be square;
it may be real or complex, nonsymmetric, and its sparsity pattern
needs not to be symmetric.  The serial computation parts are based on
the serial sparse BLAS, so that any extension made to the data
structures of the serial kernels is available to the parallel
version. The overall design and parallelization strategy have been
influenced by the structure of the ScaLAPACK parallel
library.  The layered structure of the PSBLAS library
is shown in figure~\ref{fig:psblas} ; lower layers of the library
indicate an encapsulation relationship with upper layers. The ongoing
discussion focuses on the Fortran~95 layer immediately below the
application layer.
The serial parts of the computation on each process are executed through
calls to the serial sparse BLAS subroutines. In a similar way, the
inter-process message exchanges are implemented through the Basic
Linear Algebra Communication Subroutines (BLACS) library~\cite{BLACS}
that guarantees a portable and efficient communication layer. The
Message Passing Interface code is encapsulated within the BLACS
layer. However, in some cases, MPI routines are directly used either
to improve efficiency or to implement communication patterns for which
the BLACS package doesn't provide any method. 
%% We assume that the user program has initialized a BLACS process grid
%% with one column and as many rows as there are processes; the PSBLAS
%% initialization routines will take the communication context for this
%% grid and store internally for further use. 

\begin{figure}[h] \begin{center}
\includegraphics[scale=0.45]{figures/psblas}
\end{center}
\caption{PSBLAS library components hierarchy.\label{fig:psblas}}
\end{figure}

The PSBLAS library consists of various classes of subroutines:
\begin{description}
\item[Computational routines] comprising:
\begin{itemize}
\item Sparse matrix by dense matrix product; 
\item Sparse triangular
systems solution for block diagonal matrices;
\item Vector and matrix norms;
\item Dense matrix sums;
\item Dot products.
\end{itemize} 
\item[Communication routines] handling halo and overlap
  communications;
\item[Data management and auxiliary routines] including:
\begin{itemize}
\item Parallel environment management
\item Communication descriptors allocation;
\item Dense and sparse matrix allocation;
\item Dense and sparse matrix build and update;
\item Sparse matrix and data distribution preprocessing.
\end{itemize} 
\item[Preconditioner routines]
\item[Iterative methods] a subset of Krylov subspace iterative
  methods
\end{description}
The following naming scheme has been adopted for all the symbols
internally defined in the PSBLAS software package:
\begin{itemize}
\item all the symbols (i.e. subroutine names, data types...) are
  prefixed by \verb|psb_| 
\item all the data type names are suffixed by \verb|_type|
\item all the constant values are suffixed by \verb|_|
\item all the subroutine names follow the rule \verb|psb_xxname| where
  \verb|xx| can be either:
  \begin{itemize}
  \item \verb|ge|: the routine is related to dense data, 
  \item \verb|sp|: the routine is related to sparse data, 
  \item \verb|cd|: the routine is related to communication descriptor (see~\ref{sec:datastruct}).
  \end{itemize}
  For example the \verb|psb_geins|, \verb|psb_spins| and
  \verb|psb_cdins| perform the same action (see~\ref{sec:toolsrout}) on
  dense matrices, sparse matrices and communication descriptors
  respectively.
  Interface overloading allows the usage of the same subroutine
  interfaces for both real and complex data.
\end{itemize}
In the description of the subroutines, arguments or argument entries
are classified as:
\begin{description}
\item[global] For input arguments, the value must be the same on all processes
  participating in the subroutine call; for output arguments the value
  is guaranteed to be the same.
\item[local] Each process has its own value(s) independently.
\end{description}

\subsection{Application structure}

The main underlying principle of the PSBLAS library is that the
library objects are created and exist with reference to a discretized
space to which there corresponds an index space and a matrix sparsity
pattern. As an example, consider a cell-centered finite-volume
discretization of  the Navier-Stokes equations on a simulation domain;
the index space $1\dots n$ is isomorphic to the set of cell centers,
whereas the pattern of the associated linear system matrix is
isomorphic to the adjacency graph imposed on the discretization mesh
by the discretization stencil. 

Thus the first order of business is to establish an index space, and
this is done with a call to  \verb|psb_cdall| in which we specify the
size of the index space $n$ and the allocation of the elements of the
index space to the various processes making up the MPI (virtual)
parallel machine. 

The index space is partitioned among processes, and this creates a
mapping from the ``global'' numbering $1\dots n$ to a numbering
``local'' to each process; each process $i$ will own a certain subset
$1\dots n_{\hbox{row}_i}$, each element of which corresponds to a certain
element of $1\dots n$. The user does not set explicitly this mapping;
when the application needs to indicate to which element of the index
space a certain item is related, such as the row and column index of a
matrix coefficient, it does so in the ``global'' numbering, and the
library will translate into the appropriate ``local'' numbering. 

For  a given index space $1\dots n$ there are many possible associated
topologies, i.e. many different discretization stencils; thus the
description of the index space is not completed until the user has
defined a sparsity pattern, either explicitly through \verb|psb_cdins|
or implicitly through \verb|psb_spins|. The descriptor is finalized
with a call to \verb|psb_cdasb| and a sparse matrix with a call to
\verb|psb_spasb|. After \verb|psb_cdasb| each process $i$ will have
defined a set of ``halo'' (or ``ghost'') indices
$n_{\hbox{row}_i}+1\dots n_{\hbox{col}_i}$, denoting elements of the index
space that are \emph{not} assigned to process $i$; however the
variables associated with them are needed to complete computations
associated with the sparse matrix $A$, and thus they have to be
fetched from (neighbouring) processes. The descriptor of the index
space is built exactly for the purpose of properly sequencing the
communication steps required to achieve this objective. 

A simple application structure will walk through the index space
allocation, matrix/vector creation and linear system solution as
follows:
\begin{enumerate}
\item Initialize parallel environment with \verb|psb_init|
\item Initialize index space with \verb|psb_cdall|
\item Allocate sparse matrix and dense vectors with \verb|psb_spall|
  and \verb|psb_geall|
\item Loop over all local rows, generate matrix and vector entries,
  and insert them with \verb|psb_spins| and \verb|psb_geins|
\item Assemble the various entities: 
\begin{enumerate}
\item \verb|psb_cdasb|
\item \verb|psb_spasb|
\item \verb|psb_geasb|
\end{enumerate}
\item Choose the preconditioner to be used with \verb|psb_precset| and
  build it with \verb|psb_precbld|
\item Call the iterative method of choice, e.g. \verb|psb_bicgstab|
\end{enumerate}
This is the structure of the sample program
\verb|test/pargen/ppde90.f90|. 

For a simulation in which the same discretization mesh is used over
multiple time steps, the following structure may be more appropriate:
\begin{enumerate}
\item Initialize parallel environment with \verb|psb_init|
\item Initialize index space with \verb|psb_cdall|
\item Loop over the topology of the discretization mesh and build the
  descriptor with \verb|psb_cdins|
\item Assemble the descriptor with \verb|psb_cdasb|
\item Allocate the sparse matrices and dense vectors with
  \verb|psb_spall| and \verb|psb_geall|
\item Loop over the time steps: 
\begin{enumerate}
\item If after first time step, 
  reinitialize the sparse matrix with \verb|psb_sprn|; also zero out
  the dense vectors;
\item Loop over the mesh, generate the coefficients and insert/update
  them with \verb|psb_spins| and \verb|psb_geins|
\item Assemble with \verb|psb_spasb| and \verb|psb_geasb|
\item Choose and build preconditioner with \verb|psb_precset| and
  \verb|psb_precbld|
\item Call the iterative method of choice, e.g. \verb|psb_bicgstab|
 \end{enumerate}
\end{enumerate}
The insertion routines will be called as many times as needed; it is
clear that they only need be called on the data that is actually
allocated to the current process, i.e. each process generates its own
data. 

In principle there is no specific order in the calls to
\verb|psb_spins|, nor is there a requirement to build a matrix row in
its entirety before calling the routine; this allows the application
programmer to walk through the discretization mesh element by element,
generating the main part of a given matrix row but also contributions
to the rows corresponding to neighbouring elements. 

From a functional point of view it is even possible to execute one
call for each nonzero coefficient; however this would have a
substantial computational overhead. It is therefore advisable to pack
a certain amount of data into each call to the insertion routine, say
touching on a few tens of rows; the best performng value would depend
on both the architecture of the computer being used and on the problem
structure. 
At the opposite extreme, it would be possible to generate the entire
part of a coefficient matrix residing on a process and pass it in a
single call to \verb|psb_spins|; this, however, would entail a
doubling of memory occupation, and thus would be almost always far
from optimal. 

\subsection{Programming model}

The PSBLAS librarary is based on the Single Program Multiple Data
(SPMD) programming model: each process participating in the
computation performs the same actions on a chunk of data. Parallelism
is thus data-driven. 

Because of this structure, practically all subroutines \emph{must} be
called simultaneously by all processes participating in the
computation, i.e each subroutine call acts implicitly as a
synchronization point. The exceptions to this rule are:
\begin{itemize}
\item The insertion routines \verb|psb_cdins|, \verb|psb_spins| and
  \verb|psb_geins|;
\item The error handling routines.
\end{itemize}
In particular, as per the discussion in the previous section, the
insertion routines may be called a different number of times on each
process, depending on  the data distribution chosen by the user. 

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "userguide"
%%% End:
(Module): 19 years ago			`\section{Introduction}`

* empty log message * 19 years ago			`The PSBLAS library, developed with the aim to facilitate the`
			`parallelization of computationally intensive scientific applications,`
			`is designed to address parallel implementation of iterative solvers`
			`for sparse linear systems through the distributed memory paradigm. It`
			`includes routines for multiplying sparse matrices by dense matrices,`
			`solving block diagonal systems with triangular diagonal entries,`
			`preprocessing sparse matrices, and contains additional routines for`
			`dense matrix operations. The current implementation of PSBLAS`
			`addresses a distributed memory execution model operating with message`
* empty log message * 19 years ago			`passing.`
* empty log message * 19 years ago
			`The PSBLAS library is internally implemented in a mixture of`
			`Fortran~77 and Fortran~95~\cite{metcalf} programming languages. A`
			`similar approach has been advocated by a number of authors,`
			`e.g.~\cite{machiels}. Moreover, the Fortran~95 facilities for dynamic`
			`memory management and interface overloading greatly enhance the usability of the PSBLAS`
			`subroutines. In this way, the library can take care of runtime memory`
			`requirements that are quite difficult or even impossible to predict at`
			`implementation or compilation time. The following presentation of the`
			`PSBLAS library follows the general structure of the proposal for`
One more iteration on the documentation. Also minor changes to precset. 19 years ago			`serial Sparse BLAS~\cite{sblas97,sblas02}, which in its turn is based on the`
* empty log message * 19 years ago			`proposal for BLAS on dense matrices~\cite{BLAS1,BLAS2,BLAS3}.`

			`The applicability of sparse iterative solvers to many different areas`
			`causes some terminology problems because the same concept may be`
			`denoted through different names depending on the application area. The`
			`PSBLAS features presented in this section will be discussed mainly in terms of finite`
			`difference discretizations of Partial Differential Equations (PDEs).`
			`However, the scope of the library is wider than that: for example, it`
			`can be applied to finite element discretizations of PDEs, and even to`
			`different classes of problems such as nonlinear optimization, for`
			`example in optimal control problems.`

			`The design of a solver for sparse linear systems is driven by many`
			`conflicting objectives, such as limiting occupation of storage`
			`resources, exploiting regularities in the input data, exploiting`
			`hardware characteristics of the parallel platform. To achieve an`
			`optimal communication to computation ratio on distributed memory`
			`machines it is essential to keep the {\em data locality} as high as`
			`possible; this can be done through an appropriate data allocation`
			`strategy. The choice of the preconditioner is another very important`
			`factor that affects efficiency of the implemented application. Optimal`
			`data distribution requirements for a given preconditioner may conflict`
			`with distribution requirements of the rest of the solver. Finding the`
			`optimal trade-off may be very difficult because it is application`
			`dependent. Possible solution to these problems and other important`
			`inputs to the development of the PSBLAS software package has come from`
			`an established experience in applying the PSBLAS solvers to`
			`computational fluid dynamics applications.`

			`\section{General overview}`
			`\label{sec:overview}`
			`The PSBLAS library is designed to handle the implementation of`
			`iterative solvers for sparse linear systems on distributed memory`
			`parallel computers. The system coefficient matrix $A$ must be square;`
			`it may be real or complex, nonsymmetric, and its sparsity pattern`
			`needs not to be symmetric. The serial computation parts are based on`
			`the serial sparse BLAS, so that any extension made to the data`
			`structures of the serial kernels is available to the parallel`
			`version. The overall design and parallelization strategy have been`
			`influenced by the structure of the ScaLAPACK parallel`
One more iteration on the documentation. Also minor changes to precset. 19 years ago			`library. The layered structure of the PSBLAS library`
* empty log message * 19 years ago			`is shown in figure~\ref{fig:psblas} ; lower layers of the library`
			`indicate an encapsulation relationship with upper layers. The ongoing`
			`discussion focuses on the Fortran~95 layer immediately below the`
* empty log message * 19 years ago			`application layer.`
			`The serial parts of the computation on each process are executed through`
* empty log message * 19 years ago			`calls to the serial sparse BLAS subroutines. In a similar way, the`
			`inter-process message exchanges are implemented through the Basic`
			`Linear Algebra Communication Subroutines (BLACS) library~\cite{BLACS}`
			`that guarantees a portable and efficient communication layer. The`
			`Message Passing Interface code is encapsulated within the BLACS`
			`layer. However, in some cases, MPI routines are directly used either`
			`to improve efficiency or to implement communication patterns for which`
One more iteration on the documentation. Also minor changes to precset. 19 years ago			`the BLACS package doesn't provide any method.`
Created new routines for environment handling, modified documentation accordingly. 19 years ago			`%% We assume that the user program has initialized a BLACS process grid`
			`%% with one column and as many rows as there are processes; the PSBLAS`
			`%% initialization routines will take the communication context for this`
			`%% grid and store internally for further use.`
* empty log message * 19 years ago
			`\begin{figure}[h] \begin{center}`
			`\includegraphics[scale=0.45]{figures/psblas}`
			`\end{center}`
			`\caption{PSBLAS library components hierarchy.\label{fig:psblas}}`
			`\end{figure}`

Fixed description of chapters. 19 years ago			`The PSBLAS library consists of various classes of subroutines:`
			`\begin{description}`
			`\item[Computational routines] comprising:`
* empty log message * 19 years ago			`\begin{itemize}`
Fixed description of chapters. 19 years ago			`\item Sparse matrix by dense matrix product;`
			`\item Sparse triangular`
* empty log message * 19 years ago			`systems solution for block diagonal matrices;`
			`\item Vector and matrix norms;`
			`\item Dense matrix sums;`
			`\item Dot products.`
			`\end{itemize}`
Fixed description of chapters. 19 years ago			`\item[Communication routines] handling halo and overlap`
			`communications;`
			`\item[Data management and auxiliary routines] including:`
* empty log message * 19 years ago			`\begin{itemize}`
Created new routines for environment handling, modified documentation accordingly. 19 years ago			`\item Parallel environment management`
* empty log message * 19 years ago			`\item Communication descriptors allocation;`
			`\item Dense and sparse matrix allocation;`
			`\item Dense and sparse matrix build and update;`
			`\item Sparse matrix and data distribution preprocessing.`
			`\end{itemize}`
Fixed description of chapters. 19 years ago			`\item[Preconditioner routines]`
			`\item[Iterative methods] a subset of Krylov subspace iterative`
			`methods`
			`\end{description}`
* empty log message * 19 years ago			`The following naming scheme has been adopted for all the symbols`
			`internally defined in the PSBLAS software package:`
			`\begin{itemize}`
			`\item all the symbols (i.e. subroutine names, data types...) are`
			`prefixed by \verb\|psb_\|`
			`\item all the data type names are suffixed by \verb\|_type\|`
			`\item all the constant values are suffixed by \verb\|_\|`
			`\item all the subroutine names follow the rule \verb\|psb_xxname\| where`
			`\verb\|xx\| can be either:`
			`\begin{itemize}`
* empty log message * 19 years ago			`\item \verb\|ge\|: the routine is related to dense data,`
* empty log message * 19 years ago			`\item \verb\|sp\|: the routine is related to sparse data,`
			`\item \verb\|cd\|: the routine is related to communication descriptor (see~\ref{sec:datastruct}).`
			`\end{itemize}`
* empty log message * 19 years ago			`For example the \verb\|psb_geins\|, \verb\|psb_spins\| and`
* empty log message * 19 years ago			`\verb\|psb_cdins\| perform the same action (see~\ref{sec:toolsrout}) on`
			`dense matrices, sparse matrices and communication descriptors`
			`respectively.`
			`Interface overloading allows the usage of the same subroutine`
			`interfaces for both real and complex data.`
			`\end{itemize}`
Modified interface to GEALL, GEINS et al. Changed ITRACE behaviour in solver routines. Updated documentation. 19 years ago			`In the description of the subroutines, arguments or argument entries`
			`are classified as:`
			`\begin{description}`
			`\item[global] For input arguments, the value must be the same on all processes`
			`participating in the subroutine call; for output arguments the value`
			`is guaranteed to be the same.`
			`\item[local] Each process has its own value(s) independently.`
			`\end{description}`

Added description of application structure. 19 years ago			`\subsection{Application structure}`

			`The main underlying principle of the PSBLAS library is that the`
			`library objects are created and exist with reference to a discretized`
* empty log message * 19 years ago			`space to which there corresponds an index space and a matrix sparsity`
Added description of application structure. 19 years ago			`pattern. As an example, consider a cell-centered finite-volume`
			`discretization of the Navier-Stokes equations on a simulation domain;`
			`the index space $1\dots n$ is isomorphic to the set of cell centers,`
			`whereas the pattern of the associated linear system matrix is`
			`isomorphic to the adjacency graph imposed on the discretization mesh`
			`by the discretization stencil.`

			`Thus the first order of business is to establish an index space, and`
			`this is done with a call to \verb\|psb_cdall\| in which we specify the`
			`size of the index space $n$ and the allocation of the elements of the`
			`index space to the various processes making up the MPI (virtual)`
			`parallel machine.`

			`The index space is partitioned among processes, and this creates a`
			mapping from the ``global'' numbering $1\dots n$ to a numbering
			``local'' to each process; each process $i$ will own a certain subset
			`$1\dots n_{\hbox{row}_i}$, each element of which corresponds to a certain`
			`element of $1\dots n$. The user does not set explicitly this mapping;`
			`when the application needs to indicate to which element of the index`
			`space a certain item is related, such as the row and column index of a`
			matrix coefficient, it does so in the ``global'' numbering, and the
			library will translate into the appropriate ``local'' numbering.

			`For a given index space $1\dots n$ there are many possible associated`
			`topologies, i.e. many different discretization stencils; thus the`
			`description of the index space is not completed until the user has`
			`defined a sparsity pattern, either explicitly through \verb\|psb_cdins\|`
			`or implicitly through \verb\|psb_spins\|. The descriptor is finalized`
			`with a call to \verb\|psb_cdasb\| and a sparse matrix with a call to`
			`\verb\|psb_spasb\|. After \verb\|psb_cdasb\| each process $i$ will have`
			defined a set of ``halo'' (or ``ghost'') indices
			`$n_{\hbox{row}_i}+1\dots n_{\hbox{col}_i}$, denoting elements of the index`
			`space that are \emph{not} assigned to process $i$; however the`
			`variables associated with them are needed to complete computations`
			`associated with the sparse matrix $A$, and thus they have to be`
			`fetched from (neighbouring) processes. The descriptor of the index`
			`space is built exactly for the purpose of properly sequencing the`
			`communication steps required to achieve this objective.`
* empty log message * 19 years ago
One more iteration on the documentation. Also minor changes to precset. 19 years ago			`A simple application structure will walk through the index space`
			`allocation, matrix/vector creation and linear system solution as`
			`follows:`
			`\begin{enumerate}`
Created new routines for environment handling, modified documentation accordingly. 19 years ago			`\item Initialize parallel environment with \verb\|psb_init\|`
One more iteration on the documentation. Also minor changes to precset. 19 years ago			`\item Initialize index space with \verb\|psb_cdall\|`
			`\item Allocate sparse matrix and dense vectors with \verb\|psb_spall\|`
			`and \verb\|psb_geall\|`
			`\item Loop over all local rows, generate matrix and vector entries,`
			`and insert them with \verb\|psb_spins\| and \verb\|psb_geins\|`
			`\item Assemble the various entities:`
			`\begin{enumerate}`
			`\item \verb\|psb_cdasb\|`
			`\item \verb\|psb_spasb\|`
			`\item \verb\|psb_geasb\|`
			`\end{enumerate}`
			`\item Choose the preconditioner to be used with \verb\|psb_precset\| and`
			`build it with \verb\|psb_precbld\|`
			`\item Call the iterative method of choice, e.g. \verb\|psb_bicgstab\|`
			`\end{enumerate}`
			`This is the structure of the sample program`
			`\verb\|test/pargen/ppde90.f90\|.`

			`For a simulation in which the same discretization mesh is used over`
			`multiple time steps, the following structure may be more appropriate:`
			`\begin{enumerate}`
Created new routines for environment handling, modified documentation accordingly. 19 years ago			`\item Initialize parallel environment with \verb\|psb_init\|`
One more iteration on the documentation. Also minor changes to precset. 19 years ago			`\item Initialize index space with \verb\|psb_cdall\|`
			`\item Loop over the topology of the discretization mesh and build the`
			`descriptor with \verb\|psb_cdins\|`
			`\item Assemble the descriptor with \verb\|psb_cdasb\|`
			`\item Allocate the sparse matrices and dense vectors with`
			`\verb\|psb_spall\| and \verb\|psb_geall\|`
			`\item Loop over the time steps:`
			`\begin{enumerate}`
			`\item If after first time step,`
			`reinitialize the sparse matrix with \verb\|psb_sprn\|; also zero out`
			`the dense vectors;`
			`\item Loop over the mesh, generate the coefficients and insert/update`
			`them with \verb\|psb_spins\| and \verb\|psb_geins\|`
			`\item Assemble with \verb\|psb_spasb\| and \verb\|psb_geasb\|`
			`\item Choose and build preconditioner with \verb\|psb_precset\| and`
			`\verb\|psb_precbld\|`
			`\item Call the iterative method of choice, e.g. \verb\|psb_bicgstab\|`
			`\end{enumerate}`
			`\end{enumerate}`
			`The insertion routines will be called as many times as needed; it is`
			`clear that they only need be called on the data that is actually`
			`allocated to the current process, i.e. each process generates its own`
			`data.`

			`In principle there is no specific order in the calls to`
			`\verb\|psb_spins\|, nor is there a requirement to build a matrix row in`
			`its entirety before calling the routine; this allows the application`
			`programmer to walk through the discretization mesh element by element,`
			`generating the main part of a given matrix row but also contributions`
			`to the rows corresponding to neighbouring elements.`

			`From a functional point of view it is even possible to execute one`
			`call for each nonzero coefficient; however this would have a`
			`substantial computational overhead. It is therefore advisable to pack`
			`a certain amount of data into each call to the insertion routine, say`
			`touching on a few tens of rows; the best performng value would depend`
			`on both the architecture of the computer being used and on the problem`
			`structure.`
			`At the opposite extreme, it would be possible to generate the entire`
			`part of a coefficient matrix residing on a process and pass it in a`
			`single call to \verb\|psb_spins\|; this, however, would entail a`
			`doubling of memory occupation, and thus would be almost always far`
			`from optimal.`

			`\subsection{Programming model}`

			`The PSBLAS librarary is based on the Single Program Multiple Data`
			`(SPMD) programming model: each process participating in the`
			`computation performs the same actions on a chunk of data. Parallelism`
			`is thus data-driven.`

			`Because of this structure, practically all subroutines \emph{must} be`
			`called simultaneously by all processes participating in the`
			`computation, i.e each subroutine call acts implicitly as a`
			`synchronization point. The exceptions to this rule are:`
			`\begin{itemize}`
			`\item The insertion routines \verb\|psb_cdins\|, \verb\|psb_spins\| and`
			`\verb\|psb_geins\|;`
			`\item The error handling routines.`
			`\end{itemize}`
			`In particular, as per the discussion in the previous section, the`
			`insertion routines may be called a different number of times on each`
			`process, depending on the data distribution chosen by the user.`

(Module): 19 years ago			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "userguide"`
			`%%% End:`