You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
422 lines
20 KiB
TeX
422 lines
20 KiB
TeX
\section{Introduction}\label{sec:intro}
|
|
|
|
The PSBLAS library, developed with the aim to facilitate the
|
|
parallelization of computationally intensive scientific applications,
|
|
is designed to address parallel implementation of iterative solvers
|
|
for sparse linear systems through the distributed memory paradigm. It
|
|
includes routines for multiplying sparse matrices by dense matrices,
|
|
solving block diagonal systems with triangular diagonal entries,
|
|
preprocessing sparse matrices, and contains additional routines for
|
|
dense matrix operations. The current implementation of PSBLAS
|
|
addresses a distributed memory execution model operating with message
|
|
passing.
|
|
|
|
The PSBLAS library is internally implemented in a mixture of
|
|
Fortran~77 and Fortran~95~\cite{metcalf} programming languages.
|
|
A similar approach has been advocated by a number of authors,
|
|
e.g.~\cite{machiels}. Moreover, the Fortran~95 facilities for dynamic
|
|
memory management and interface overloading greatly enhance the
|
|
usability of the PSBLAS
|
|
subroutines. In this way, the library can take care of runtime memory
|
|
requirements that are quite difficult or even impossible to predict at
|
|
implementation or compilation time.
|
|
In the current release we rely on the availability of the so-called
|
|
allocatable extensions, specified in TR~15581. Strictly speaking they
|
|
are outside the Fortran~95 standard; however they have been included
|
|
in the Fortran~2003 language standard, and are available in
|
|
practically all Fortran~95 compilers on the market, including the GNU
|
|
Fortran compiler from the Free Software Foundation (as of version 4.2).
|
|
The presentation of the
|
|
PSBLAS library follows the general structure of the proposal for
|
|
serial Sparse BLAS~\cite{sblas97,sblas02}, which in its turn is based on the
|
|
proposal for BLAS on dense matrices~\cite{BLAS1,BLAS2,BLAS3}.
|
|
|
|
The applicability of sparse iterative solvers to many different areas
|
|
causes some terminology problems because the same concept may be
|
|
denoted through different names depending on the application area. The
|
|
PSBLAS features presented in this document will be discussed referring
|
|
to a finite difference discretization of a Partial Differential
|
|
Equation (PDE). However, the scope of the library is wider than
|
|
that: for example, it can be applied to finite element discretizations
|
|
of PDEs, and even to different classes of problems such as nonlinear
|
|
optimization, for example in optimal control problems.
|
|
|
|
The design of a solver for sparse linear systems is driven by many
|
|
conflicting objectives, such as limiting occupation of storage
|
|
resources, exploiting regularities in the input data, exploiting
|
|
hardware characteristics of the parallel platform. To achieve an
|
|
optimal communication to computation ratio on distributed memory
|
|
machines it is essential to keep the {\em data locality} as high as
|
|
possible; this can be done through an appropriate data allocation
|
|
strategy. The choice of the preconditioner is another very important
|
|
factor that affects efficiency of the implemented application. Optimal
|
|
data distribution requirements for a given preconditioner may conflict
|
|
with distribution requirements of the rest of the solver. Finding the
|
|
optimal trade-off may be very difficult because it is application
|
|
dependent. Possible solutions to these problems and other important
|
|
inputs to the development of the PSBLAS software package have come from
|
|
an established experience in applying the PSBLAS solvers to
|
|
computational fluid dynamics applications.
|
|
|
|
\section{General overview}
|
|
\label{sec:overview}
|
|
The PSBLAS library is designed to handle the implementation of
|
|
iterative solvers for sparse linear systems on distributed memory
|
|
parallel computers. The system coefficient matrix $A$ must be square;
|
|
it may be real or complex, nonsymmetric, and its sparsity pattern
|
|
needs not to be symmetric. The serial computation parts are based on
|
|
the serial sparse BLAS, so that any extension made to the data
|
|
structures of the serial kernels is available to the parallel
|
|
version. The overall design and parallelization strategy have been
|
|
influenced by the structure of the ScaLAPACK parallel
|
|
library. The layered structure of the PSBLAS library
|
|
is shown in figure~\ref{fig:psblas} ; lower layers of the library
|
|
indicate an encapsulation relationship with upper layers. The ongoing
|
|
discussion focuses on the Fortran~95 layer immediately below the
|
|
application layer.
|
|
The serial parts of the computation on each process are executed through
|
|
calls to the serial sparse BLAS subroutines. In a similar way, the
|
|
inter-process message exchanges are implemented through the Basic
|
|
Linear Algebra Communication Subroutines (BLACS) library~\cite{BLACS}
|
|
that guarantees a portable and efficient communication layer. The
|
|
Message Passing Interface code is encapsulated within the BLACS
|
|
layer. However, in some cases, MPI routines are directly used either
|
|
to improve efficiency or to implement communication patterns for which
|
|
the BLACS package doesn't provide any method.
|
|
|
|
In any case we provide wrappers around the BLACS routines so that the
|
|
user does not need to delve into their details (see Sec.~\ref{sec:parenv}).
|
|
%% We assume that the user program has initialized a BLACS process grid
|
|
%% with one column and as many rows as there are processes; the PSBLAS
|
|
%% initialization routines will take the communication context for this
|
|
%% grid and store internally for further use.
|
|
|
|
\begin{figure}[h] \begin{center}
|
|
\includegraphics[scale=0.45]{figures/psblas}
|
|
\end{center}
|
|
\caption{PSBLAS library components hierarchy.\label{fig:psblas}}
|
|
\end{figure}
|
|
|
|
|
|
The type of linear system matrices that we address typically arise in the
|
|
numerical solution of PDEs; in such a context,
|
|
it is necessary to pay special attention to the
|
|
structure of the problem from which the application originates.
|
|
The nonzero pattern of a matrix arising from the
|
|
discretization of a PDE is influenced by various factors, such as the
|
|
shape of the domain, the discretization strategy, and
|
|
the equation/unknown ordering. The matrix itself can be interpreted as
|
|
the adjacency matrix of the graph associated with the discretization
|
|
mesh.
|
|
|
|
The distribution of the coefficient matrix for the linear system is
|
|
based on the ``owner computes'' rule:
|
|
the variable associated to each mesh point is assigned to a process
|
|
that will own the corresponding row in the coefficient matrix and
|
|
will carry out all related computations. This allocation strategy
|
|
is equivalent to a partition of the discretization mesh into {\em
|
|
sub-domains}.
|
|
Our library supports any distribution that keeps together
|
|
the coefficients of each matrix row; there are no other constraints on
|
|
the variable assignment.
|
|
This choice is consistent with data distributions commonly used in
|
|
ScaLAPACK such as \verb|CYCLIC(N)| and \verb|BLOCK|,
|
|
as well as completely arbitrary assignments of
|
|
equation indices to processes. In particular it is consistent with the
|
|
usage of graph partitioning tools commonly available in the
|
|
literature, e.g. METIS~\cite{METIS}.
|
|
Dense vectors conform to sparse
|
|
matrices, that is, the entries of a vector follow the same distribution
|
|
of the matrix rows.
|
|
|
|
We assume that the sparse matrix is built in parallel, where each
|
|
process generates its own portion. We never require that the entire
|
|
matrix be available on a single node. However, it is possible
|
|
to hold the entire matrix in one process and distribute it
|
|
explicitly\footnote{In our prototype implementation we provide
|
|
sample scatter/gather routines.}, even though the resulting
|
|
bottleneck would make this option unattractive in most cases.
|
|
|
|
|
|
\subsection{Basic Nomenclature}
|
|
|
|
|
|
Our computational model implies that the data allocation on the
|
|
parallel distributed memory machine is guided by the structure of the
|
|
physical model, and specifically by the discretization mesh of the
|
|
PDE.
|
|
|
|
Each point of the discretization mesh will have (at least) one
|
|
associated equation/variable, and therefore one index. We say that
|
|
point $i$ {\em depends\/} on point $j$ if the equation for a
|
|
variable associated with $i$ contains a term in $j$, or equivalently
|
|
if $a_{ij} \ne0$.
|
|
After the partition of the discretization mesh into {\em sub-domains\/}
|
|
assigned to the parallel processes,
|
|
we classify the points of a given sub-domain as following.
|
|
\begin{description}
|
|
\item[Internal.] An internal point of
|
|
a given domain {\em depends} only on points of the
|
|
same domain.
|
|
If all points of a domain are assigned to one
|
|
process, then a computational step (e.g., a
|
|
matrix-vector product) of the
|
|
equations associated with the internal points requires no data
|
|
items from other domains and no communications.
|
|
|
|
\item[Boundary.] A point of
|
|
a given domain is a boundary point if it {\em depends} on points
|
|
belonging to other domains.
|
|
|
|
\item[Halo.] A halo point for a given domain is a point belonging to
|
|
another domain such that there is a boundary point which {\em depends\/}
|
|
on it. Whenever performing a computational step, such as a
|
|
matrix-vector product, the values associated with halo points are
|
|
requested from other domains. A boundary point of a given
|
|
domain is a halo point for (at least) another domain; therefore
|
|
the cardinality of the boundary points set denotes the amount of data
|
|
sent to other domains.
|
|
\item[Overlap.] An overlap point is a boundary point assigned to
|
|
multiple domains. Any operation that involves an overlap point
|
|
has to be replicated for each assignment.
|
|
\end{description}
|
|
Overlap points do not usually exist in the basic data
|
|
distribution, but they are a feature of Domain Decomposition
|
|
Schwarz preconditioners which we are in the process of including in
|
|
our distribution~\cite{PARA04,APNUM06}.
|
|
|
|
We denote the sets of internal, boundary and halo points for a given
|
|
subdomain by $\cal I$, $\cal B$ and $\cal H$.
|
|
Each subdomain is assigned to one process; each process usually
|
|
owns one subdomain, although the user may choose to assign more than
|
|
one subdomain to a process. If each process $i$ owns one
|
|
subdomain, the number of rows in the local sparse matrix is
|
|
$|{\cal I}_i| + |{\cal B}_i|$, and the number of local columns
|
|
(i.e. those for which there exists at least one non-zero entry in the
|
|
local rows) is $|{\cal I}_i| + |{\cal B}_i| +|{\cal H}_i|$.
|
|
|
|
\begin{figure}[h]
|
|
\begin{center}
|
|
\rotatebox{-90}{\includegraphics[scale=0.65]{figures/points}}
|
|
\end{center}
|
|
\caption{Point classfication.\label{fig:points}}
|
|
\end{figure}
|
|
|
|
This classification of mesh points guides the naming scheme that we
|
|
adopted in the library internals and in the data structures. We
|
|
explicitly note that ``Halo'' points are also often called ``ghost''
|
|
points in the literature.
|
|
|
|
|
|
|
|
\subsection{Library contents}
|
|
|
|
The PSBLAS library consists of various classes of subroutines:
|
|
\begin{description}
|
|
\item[Computational routines] comprising:
|
|
\begin{itemize}
|
|
\item Sparse matrix by dense matrix product;
|
|
\item Sparse triangular
|
|
systems solution for block diagonal matrices;
|
|
\item Vector and matrix norms;
|
|
\item Dense matrix sums;
|
|
\item Dot products.
|
|
\end{itemize}
|
|
\item[Communication routines] handling halo and overlap
|
|
communications;
|
|
\item[Data management and auxiliary routines] including:
|
|
\begin{itemize}
|
|
\item Parallel environment management
|
|
\item Communication descriptors allocation;
|
|
\item Dense and sparse matrix allocation;
|
|
\item Dense and sparse matrix build and update;
|
|
\item Sparse matrix and data distribution preprocessing.
|
|
\end{itemize}
|
|
\item[Preconditioner routines]
|
|
\item[Iterative methods] a subset of Krylov subspace iterative
|
|
methods
|
|
\end{description}
|
|
The following naming scheme has been adopted for all the symbols
|
|
internally defined in the PSBLAS software package:
|
|
\begin{itemize}
|
|
\item all the symbols (i.e. subroutine names, data types...) are
|
|
prefixed by \verb|psb_|
|
|
\item all the data type names are suffixed by \verb|_type|
|
|
\item all the constant values are suffixed by \verb|_|
|
|
\item all the subroutine names follow the rule \verb|psb_xxname| where
|
|
\verb|xx| can be either:
|
|
\begin{itemize}
|
|
\item \verb|ge|: the routine is related to dense data,
|
|
\item \verb|sp|: the routine is related to sparse data,
|
|
\item \verb|cd|: the routine is related to communication descriptor
|
|
(see~\ref{sec:datastruct}).
|
|
\end{itemize}
|
|
For example the \verb|psb_geins|, \verb|psb_spins| and
|
|
\verb|psb_cdins| perform the same action (see~\ref{sec:toolsrout}) on
|
|
dense matrices, sparse matrices and communication descriptors
|
|
respectively.
|
|
Interface overloading allows the usage of the same subroutine
|
|
interfaces for both real and complex data.
|
|
\end{itemize}
|
|
In the description of the subroutines, arguments or argument entries
|
|
are classified as:
|
|
\begin{description}
|
|
\item[global] For input arguments, the value must be the same on all processes
|
|
participating in the subroutine call; for output arguments the value
|
|
is guaranteed to be the same.
|
|
\item[local] Each process has its own value(s) independently.
|
|
\end{description}
|
|
|
|
\subsection{Application structure}
|
|
|
|
The main underlying principle of the PSBLAS library is that the
|
|
library objects are created and exist with reference to a discretized
|
|
space to which there corresponds an index space and a matrix sparsity
|
|
pattern. As an example, consider a cell-centered finite-volume
|
|
discretization of the Navier-Stokes equations on a simulation domain;
|
|
the index space $1\dots n$ is isomorphic to the set of cell centers,
|
|
whereas the pattern of the associated linear system matrix is
|
|
isomorphic to the adjacency graph imposed on the discretization mesh
|
|
by the discretization stencil.
|
|
|
|
Thus the first order of business is to establish an index space, and
|
|
this is done with a call to \verb|psb_cdall| in which we specify the
|
|
size of the index space $n$ and the allocation of the elements of the
|
|
index space to the various processes making up the MPI (virtual)
|
|
parallel machine.
|
|
|
|
The index space is partitioned among processes, and this creates a
|
|
mapping from the ``global'' numbering $1\dots n$ to a numbering
|
|
``local'' to each process; each process $i$ will own a certain subset
|
|
$1\dots n_{\hbox{row}_i}$, each element of which corresponds to a certain
|
|
element of $1\dots n$. The user does not set explicitly this mapping;
|
|
when the application needs to indicate to which element of the index
|
|
space a certain item is related, such as the row and column index of a
|
|
matrix coefficient, it does so in the ``global'' numbering, and the
|
|
library will translate into the appropriate ``local'' numbering.
|
|
|
|
For a given index space $1\dots n$ there are many possible associated
|
|
topologies, i.e. many different discretization stencils; thus the
|
|
description of the index space is not completed until the user has
|
|
defined a sparsity pattern, either explicitly through \verb|psb_cdins|
|
|
or implicitly through \verb|psb_spins|. The descriptor is finalized
|
|
with a call to \verb|psb_cdasb| and a sparse matrix with a call to
|
|
\verb|psb_spasb|. After \verb|psb_cdasb| each process $i$ will have
|
|
defined a set of ``halo'' (or ``ghost'') indices
|
|
$n_{\hbox{row}_i}+1\dots n_{\hbox{col}_i}$, denoting elements of the index
|
|
space that are \emph{not} assigned to process $i$; however the
|
|
variables associated with them are needed to complete computations
|
|
associated with the sparse matrix $A$, and thus they have to be
|
|
fetched from (neighbouring) processes. The descriptor of the index
|
|
space is built exactly for the purpose of properly sequencing the
|
|
communication steps required to achieve this objective.
|
|
|
|
A simple application structure will walk through the index space
|
|
allocation, matrix/vector creation and linear system solution as
|
|
follows:
|
|
\begin{enumerate}
|
|
\item Initialize parallel environment with \verb|psb_init|
|
|
\item Initialize index space with \verb|psb_cdall|
|
|
\item Allocate sparse matrix and dense vectors with \verb|psb_spall|
|
|
and \verb|psb_geall|
|
|
\item Loop over all local rows, generate matrix and vector entries,
|
|
and insert them with \verb|psb_spins| and \verb|psb_geins|
|
|
\item Assemble the various entities:
|
|
\begin{enumerate}
|
|
\item \verb|psb_cdasb|
|
|
\item \verb|psb_spasb|
|
|
\item \verb|psb_geasb|
|
|
\end{enumerate}
|
|
\item Choose the preconditioner to be used with \verb|psb_precset| and
|
|
build it with \verb|psb_precbld|
|
|
\item Call the iterative method of choice, e.g. \verb|psb_bicgstab|
|
|
\end{enumerate}
|
|
This is the structure of the sample program
|
|
\verb|test/pargen/ppde90.f90|.
|
|
|
|
For a simulation in which the same discretization mesh is used over
|
|
multiple time steps, the following structure may be more appropriate:
|
|
\begin{enumerate}
|
|
\item Initialize parallel environment with \verb|psb_init|
|
|
\item Initialize index space with \verb|psb_cdall|
|
|
\item Loop over the topology of the discretization mesh and build the
|
|
descriptor with \verb|psb_cdins|
|
|
\item Assemble the descriptor with \verb|psb_cdasb|
|
|
\item Allocate the sparse matrices and dense vectors with
|
|
\verb|psb_spall| and \verb|psb_geall|
|
|
\item Loop over the time steps:
|
|
\begin{enumerate}
|
|
\item If after first time step,
|
|
reinitialize the sparse matrix with \verb|psb_sprn|; also zero out
|
|
the dense vectors;
|
|
\item Loop over the mesh, generate the coefficients and insert/update
|
|
them with \verb|psb_spins| and \verb|psb_geins|
|
|
\item Assemble with \verb|psb_spasb| and \verb|psb_geasb|
|
|
\item Choose and build preconditioner with \verb|psb_precset| and
|
|
\verb|psb_precbld|
|
|
\item Call the iterative method of choice, e.g. \verb|psb_bicgstab|
|
|
\end{enumerate}
|
|
\end{enumerate}
|
|
The insertion routines will be called as many times as needed;
|
|
they only need to be called on the data that is actually
|
|
allocated to the current process, i.e. each process generates its own
|
|
data.
|
|
|
|
In principle there is no specific order in the calls to
|
|
\verb|psb_spins|, nor is there a requirement to build a matrix row in
|
|
its entirety before calling the routine; this allows the application
|
|
programmer to walk through the discretization mesh element by element,
|
|
generating the main part of a given matrix row but also contributions
|
|
to the rows corresponding to neighbouring elements.
|
|
|
|
From a functional point of view it is even possible to execute one
|
|
call for each nonzero coefficient; however this would have a
|
|
substantial computational overhead. It is therefore advisable to pack
|
|
a certain amount of data into each call to the insertion routine, say
|
|
touching on a few tens of rows; the best performng value would depend
|
|
on both the architecture of the computer being used and on the problem
|
|
structure.
|
|
At the opposite extreme, it would be possible to generate the entire
|
|
part of a coefficient matrix residing on a process and pass it in a
|
|
single call to \verb|psb_spins|; this, however, would entail a
|
|
doubling of memory occupation, and thus would be almost always far
|
|
from optimal.
|
|
|
|
\subsection{Programming model}
|
|
|
|
The PSBLAS librarary is based on the Single Program Multiple Data
|
|
(SPMD) programming model: each process participating in the
|
|
computation performs the same actions on a chunk of data. Parallelism
|
|
is thus data-driven.
|
|
|
|
Because of this structure, many subroutines coordinate their action
|
|
across the various processes, thus providing an implicit
|
|
synchronization point, and therefore \emph{must} be
|
|
called simultaneously by all processes participating in the
|
|
computation. This is certainly true for the data allocation and
|
|
assembly routines, for all the computational routines and for some of
|
|
the tools routines.
|
|
|
|
However there are many cases where no synchronization, and indeed no
|
|
communication among processes, is implied; for instance, all the routines in
|
|
sec.~\ref{sec:dataquery} are only acting on the local data structures,
|
|
and thus may be called independently. The most important case is that
|
|
of the coefficient insertion routines: since the number of
|
|
coefficients in the sparse and dense matrices varies among the
|
|
processors, and since the user is free to choose an arbitrary order in
|
|
builiding the matrix entries, these routines cannot imply a
|
|
synchronization.
|
|
|
|
Throughout this user's guide each subroutine will be clearly indicated
|
|
as:
|
|
\begin{description}
|
|
\item[Synchronous:] must be called simultaneously by all the
|
|
processes in the relevant communication context;
|
|
\item[Asynchronous:] may be called in a totally independent manner.
|
|
\end{description}
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "userguide"
|
|
%%% End:
|