You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
283 lines
14 KiB
TeX
283 lines
14 KiB
TeX
\section{Introduction}
|
|
|
|
The PSBLAS library, developed with the aim to facilitate the
|
|
parallelization of computationally intensive scientific applications,
|
|
is designed to address parallel implementation of iterative solvers
|
|
for sparse linear systems through the distributed memory paradigm. It
|
|
includes routines for multiplying sparse matrices by dense matrices,
|
|
solving block diagonal systems with triangular diagonal entries,
|
|
preprocessing sparse matrices, and contains additional routines for
|
|
dense matrix operations. The current implementation of PSBLAS
|
|
addresses a distributed memory execution model operating with message
|
|
passing.
|
|
|
|
The PSBLAS library is internally implemented in a mixture of
|
|
Fortran~77 and Fortran~95~\cite{metcalf} programming languages. A
|
|
similar approach has been advocated by a number of authors,
|
|
e.g.~\cite{machiels}. Moreover, the Fortran~95 facilities for dynamic
|
|
memory management and interface overloading greatly enhance the usability of the PSBLAS
|
|
subroutines. In this way, the library can take care of runtime memory
|
|
requirements that are quite difficult or even impossible to predict at
|
|
implementation or compilation time. The following presentation of the
|
|
PSBLAS library follows the general structure of the proposal for
|
|
serial Sparse BLAS~\cite{sblas97,sblas02}, which in its turn is based on the
|
|
proposal for BLAS on dense matrices~\cite{BLAS1,BLAS2,BLAS3}.
|
|
|
|
The applicability of sparse iterative solvers to many different areas
|
|
causes some terminology problems because the same concept may be
|
|
denoted through different names depending on the application area. The
|
|
PSBLAS features presented in this section will be discussed mainly in terms of finite
|
|
difference discretizations of Partial Differential Equations (PDEs).
|
|
However, the scope of the library is wider than that: for example, it
|
|
can be applied to finite element discretizations of PDEs, and even to
|
|
different classes of problems such as nonlinear optimization, for
|
|
example in optimal control problems.
|
|
|
|
The design of a solver for sparse linear systems is driven by many
|
|
conflicting objectives, such as limiting occupation of storage
|
|
resources, exploiting regularities in the input data, exploiting
|
|
hardware characteristics of the parallel platform. To achieve an
|
|
optimal communication to computation ratio on distributed memory
|
|
machines it is essential to keep the {\em data locality} as high as
|
|
possible; this can be done through an appropriate data allocation
|
|
strategy. The choice of the preconditioner is another very important
|
|
factor that affects efficiency of the implemented application. Optimal
|
|
data distribution requirements for a given preconditioner may conflict
|
|
with distribution requirements of the rest of the solver. Finding the
|
|
optimal trade-off may be very difficult because it is application
|
|
dependent. Possible solution to these problems and other important
|
|
inputs to the development of the PSBLAS software package has come from
|
|
an established experience in applying the PSBLAS solvers to
|
|
computational fluid dynamics applications.
|
|
|
|
\section{General overview}
|
|
\label{sec:overview}
|
|
The PSBLAS library is designed to handle the implementation of
|
|
iterative solvers for sparse linear systems on distributed memory
|
|
parallel computers. The system coefficient matrix $A$ must be square;
|
|
it may be real or complex, nonsymmetric, and its sparsity pattern
|
|
needs not to be symmetric. The serial computation parts are based on
|
|
the serial sparse BLAS, so that any extension made to the data
|
|
structures of the serial kernels is available to the parallel
|
|
version. The overall design and parallelization strategy have been
|
|
influenced by the structure of the ScaLAPACK parallel
|
|
library. The layered structure of the PSBLAS library
|
|
is shown in figure~\ref{fig:psblas} ; lower layers of the library
|
|
indicate an encapsulation relationship with upper layers. The ongoing
|
|
discussion focuses on the Fortran~95 layer immediately below the
|
|
application layer.
|
|
The serial parts of the computation on each process are executed through
|
|
calls to the serial sparse BLAS subroutines. In a similar way, the
|
|
inter-process message exchanges are implemented through the Basic
|
|
Linear Algebra Communication Subroutines (BLACS) library~\cite{BLACS}
|
|
that guarantees a portable and efficient communication layer. The
|
|
Message Passing Interface code is encapsulated within the BLACS
|
|
layer. However, in some cases, MPI routines are directly used either
|
|
to improve efficiency or to implement communication patterns for which
|
|
the BLACS package doesn't provide any method.
|
|
%% We assume that the user program has initialized a BLACS process grid
|
|
%% with one column and as many rows as there are processes; the PSBLAS
|
|
%% initialization routines will take the communication context for this
|
|
%% grid and store internally for further use.
|
|
|
|
\begin{figure}[h] \begin{center}
|
|
\includegraphics[scale=0.45]{figures/psblas}
|
|
\end{center}
|
|
\caption{PSBLAS library components hierarchy.\label{fig:psblas}}
|
|
\end{figure}
|
|
|
|
The PSBLAS library consists of various classes of subroutines:
|
|
\begin{description}
|
|
\item[Computational routines] comprising:
|
|
\begin{itemize}
|
|
\item Sparse matrix by dense matrix product;
|
|
\item Sparse triangular
|
|
systems solution for block diagonal matrices;
|
|
\item Vector and matrix norms;
|
|
\item Dense matrix sums;
|
|
\item Dot products.
|
|
\end{itemize}
|
|
\item[Communication routines] handling halo and overlap
|
|
communications;
|
|
\item[Data management and auxiliary routines] including:
|
|
\begin{itemize}
|
|
\item Parallel environment management
|
|
\item Communication descriptors allocation;
|
|
\item Dense and sparse matrix allocation;
|
|
\item Dense and sparse matrix build and update;
|
|
\item Sparse matrix and data distribution preprocessing.
|
|
\end{itemize}
|
|
\item[Preconditioner routines]
|
|
\item[Iterative methods] a subset of Krylov subspace iterative
|
|
methods
|
|
\end{description}
|
|
The following naming scheme has been adopted for all the symbols
|
|
internally defined in the PSBLAS software package:
|
|
\begin{itemize}
|
|
\item all the symbols (i.e. subroutine names, data types...) are
|
|
prefixed by \verb|psb_|
|
|
\item all the data type names are suffixed by \verb|_type|
|
|
\item all the constant values are suffixed by \verb|_|
|
|
\item all the subroutine names follow the rule \verb|psb_xxname| where
|
|
\verb|xx| can be either:
|
|
\begin{itemize}
|
|
\item \verb|ge|: the routine is related to dense data,
|
|
\item \verb|sp|: the routine is related to sparse data,
|
|
\item \verb|cd|: the routine is related to communication descriptor (see~\ref{sec:datastruct}).
|
|
\end{itemize}
|
|
For example the \verb|psb_geins|, \verb|psb_spins| and
|
|
\verb|psb_cdins| perform the same action (see~\ref{sec:toolsrout}) on
|
|
dense matrices, sparse matrices and communication descriptors
|
|
respectively.
|
|
Interface overloading allows the usage of the same subroutine
|
|
interfaces for both real and complex data.
|
|
\end{itemize}
|
|
In the description of the subroutines, arguments or argument entries
|
|
are classified as:
|
|
\begin{description}
|
|
\item[global] For input arguments, the value must be the same on all processes
|
|
participating in the subroutine call; for output arguments the value
|
|
is guaranteed to be the same.
|
|
\item[local] Each process has its own value(s) independently.
|
|
\end{description}
|
|
|
|
\subsection{Application structure}
|
|
|
|
The main underlying principle of the PSBLAS library is that the
|
|
library objects are created and exist with reference to a discretized
|
|
space to which there corresponds an index space and a matrix sparsity
|
|
pattern. As an example, consider a cell-centered finite-volume
|
|
discretization of the Navier-Stokes equations on a simulation domain;
|
|
the index space $1\dots n$ is isomorphic to the set of cell centers,
|
|
whereas the pattern of the associated linear system matrix is
|
|
isomorphic to the adjacency graph imposed on the discretization mesh
|
|
by the discretization stencil.
|
|
|
|
Thus the first order of business is to establish an index space, and
|
|
this is done with a call to \verb|psb_cdall| in which we specify the
|
|
size of the index space $n$ and the allocation of the elements of the
|
|
index space to the various processes making up the MPI (virtual)
|
|
parallel machine.
|
|
|
|
The index space is partitioned among processes, and this creates a
|
|
mapping from the ``global'' numbering $1\dots n$ to a numbering
|
|
``local'' to each process; each process $i$ will own a certain subset
|
|
$1\dots n_{\hbox{row}_i}$, each element of which corresponds to a certain
|
|
element of $1\dots n$. The user does not set explicitly this mapping;
|
|
when the application needs to indicate to which element of the index
|
|
space a certain item is related, such as the row and column index of a
|
|
matrix coefficient, it does so in the ``global'' numbering, and the
|
|
library will translate into the appropriate ``local'' numbering.
|
|
|
|
For a given index space $1\dots n$ there are many possible associated
|
|
topologies, i.e. many different discretization stencils; thus the
|
|
description of the index space is not completed until the user has
|
|
defined a sparsity pattern, either explicitly through \verb|psb_cdins|
|
|
or implicitly through \verb|psb_spins|. The descriptor is finalized
|
|
with a call to \verb|psb_cdasb| and a sparse matrix with a call to
|
|
\verb|psb_spasb|. After \verb|psb_cdasb| each process $i$ will have
|
|
defined a set of ``halo'' (or ``ghost'') indices
|
|
$n_{\hbox{row}_i}+1\dots n_{\hbox{col}_i}$, denoting elements of the index
|
|
space that are \emph{not} assigned to process $i$; however the
|
|
variables associated with them are needed to complete computations
|
|
associated with the sparse matrix $A$, and thus they have to be
|
|
fetched from (neighbouring) processes. The descriptor of the index
|
|
space is built exactly for the purpose of properly sequencing the
|
|
communication steps required to achieve this objective.
|
|
|
|
A simple application structure will walk through the index space
|
|
allocation, matrix/vector creation and linear system solution as
|
|
follows:
|
|
\begin{enumerate}
|
|
\item Initialize parallel environment with \verb|psb_init|
|
|
\item Initialize index space with \verb|psb_cdall|
|
|
\item Allocate sparse matrix and dense vectors with \verb|psb_spall|
|
|
and \verb|psb_geall|
|
|
\item Loop over all local rows, generate matrix and vector entries,
|
|
and insert them with \verb|psb_spins| and \verb|psb_geins|
|
|
\item Assemble the various entities:
|
|
\begin{enumerate}
|
|
\item \verb|psb_cdasb|
|
|
\item \verb|psb_spasb|
|
|
\item \verb|psb_geasb|
|
|
\end{enumerate}
|
|
\item Choose the preconditioner to be used with \verb|psb_precset| and
|
|
build it with \verb|psb_precbld|
|
|
\item Call the iterative method of choice, e.g. \verb|psb_bicgstab|
|
|
\end{enumerate}
|
|
This is the structure of the sample program
|
|
\verb|test/pargen/ppde90.f90|.
|
|
|
|
For a simulation in which the same discretization mesh is used over
|
|
multiple time steps, the following structure may be more appropriate:
|
|
\begin{enumerate}
|
|
\item Initialize parallel environment with \verb|psb_init|
|
|
\item Initialize index space with \verb|psb_cdall|
|
|
\item Loop over the topology of the discretization mesh and build the
|
|
descriptor with \verb|psb_cdins|
|
|
\item Assemble the descriptor with \verb|psb_cdasb|
|
|
\item Allocate the sparse matrices and dense vectors with
|
|
\verb|psb_spall| and \verb|psb_geall|
|
|
\item Loop over the time steps:
|
|
\begin{enumerate}
|
|
\item If after first time step,
|
|
reinitialize the sparse matrix with \verb|psb_sprn|; also zero out
|
|
the dense vectors;
|
|
\item Loop over the mesh, generate the coefficients and insert/update
|
|
them with \verb|psb_spins| and \verb|psb_geins|
|
|
\item Assemble with \verb|psb_spasb| and \verb|psb_geasb|
|
|
\item Choose and build preconditioner with \verb|psb_precset| and
|
|
\verb|psb_precbld|
|
|
\item Call the iterative method of choice, e.g. \verb|psb_bicgstab|
|
|
\end{enumerate}
|
|
\end{enumerate}
|
|
The insertion routines will be called as many times as needed; it is
|
|
clear that they only need be called on the data that is actually
|
|
allocated to the current process, i.e. each process generates its own
|
|
data.
|
|
|
|
In principle there is no specific order in the calls to
|
|
\verb|psb_spins|, nor is there a requirement to build a matrix row in
|
|
its entirety before calling the routine; this allows the application
|
|
programmer to walk through the discretization mesh element by element,
|
|
generating the main part of a given matrix row but also contributions
|
|
to the rows corresponding to neighbouring elements.
|
|
|
|
From a functional point of view it is even possible to execute one
|
|
call for each nonzero coefficient; however this would have a
|
|
substantial computational overhead. It is therefore advisable to pack
|
|
a certain amount of data into each call to the insertion routine, say
|
|
touching on a few tens of rows; the best performng value would depend
|
|
on both the architecture of the computer being used and on the problem
|
|
structure.
|
|
At the opposite extreme, it would be possible to generate the entire
|
|
part of a coefficient matrix residing on a process and pass it in a
|
|
single call to \verb|psb_spins|; this, however, would entail a
|
|
doubling of memory occupation, and thus would be almost always far
|
|
from optimal.
|
|
|
|
\subsection{Programming model}
|
|
|
|
The PSBLAS librarary is based on the Single Program Multiple Data
|
|
(SPMD) programming model: each process participating in the
|
|
computation performs the same actions on a chunk of data. Parallelism
|
|
is thus data-driven.
|
|
|
|
Because of this structure, practically all subroutines \emph{must} be
|
|
called simultaneously by all processes participating in the
|
|
computation, i.e each subroutine call acts implicitly as a
|
|
synchronization point. The exceptions to this rule are:
|
|
\begin{itemize}
|
|
\item The insertion routines \verb|psb_cdins|, \verb|psb_spins| and
|
|
\verb|psb_geins|;
|
|
\item The error handling routines.
|
|
\end{itemize}
|
|
In particular, as per the discussion in the previous section, the
|
|
insertion routines may be called a different number of times on each
|
|
process, depending on the data distribution chosen by the user.
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "userguide"
|
|
%%% End:
|