\section{Introduction} The PSBLAS library, developed with the aim to facilitate the parallelization of computationally intensive scientific applications, is designed to address parallel implementation of iterative solvers for sparse linear systems through the distributed memory paradigm. It includes routines for multiplying sparse matrices by dense matrices, solving block diagonal systems with triangular diagonal entries, preprocessing sparse matrices, and contains additional routines for dense matrix operations. The current implementation of PSBLAS addresses a distributed memory execution model operating with message passing. However, the overall design does not preclude different implementation paradigms, such as those based on a shared memory model. The PSBLAS library is internally implemented in a mixture of Fortran~77 and Fortran~95~\cite{metcalf} programming languages. A similar approach has been advocated by a number of authors, e.g.~\cite{machiels}. Moreover, the Fortran~95 facilities for dynamic memory management and interface overloading greatly enhance the usability of the PSBLAS subroutines. In this way, the library can take care of runtime memory requirements that are quite difficult or even impossible to predict at implementation or compilation time. The following presentation of the PSBLAS library follows the general structure of the proposal for serial Sparse BLAS~\cite{sblas97}, which in its turn is based on the proposal for BLAS on dense matrices~\cite{BLAS1,BLAS2,BLAS3}. The applicability of sparse iterative solvers to many different areas causes some terminology problems because the same concept may be denoted through different names depending on the application area. The PSBLAS features presented in this section will be discussed mainly in terms of finite difference discretizations of Partial Differential Equations (PDEs). However, the scope of the library is wider than that: for example, it can be applied to finite element discretizations of PDEs, and even to different classes of problems such as nonlinear optimization, for example in optimal control problems. The design of a solver for sparse linear systems is driven by many conflicting objectives, such as limiting occupation of storage resources, exploiting regularities in the input data, exploiting hardware characteristics of the parallel platform. To achieve an optimal communication to computation ratio on distributed memory machines it is essential to keep the {\em data locality} as high as possible; this can be done through an appropriate data allocation strategy. The choice of the preconditioner is another very important factor that affects efficiency of the implemented application. Optimal data distribution requirements for a given preconditioner may conflict with distribution requirements of the rest of the solver. Finding the optimal trade-off may be very difficult because it is application dependent. Possible solution to these problems and other important inputs to the development of the PSBLAS software package has come from an established experience in applying the PSBLAS solvers to computational fluid dynamics applications. \section{General overview} \label{sec:overview} The PSBLAS library is designed to handle the implementation of iterative solvers for sparse linear systems on distributed memory parallel computers. The system coefficient matrix $A$ must be square; it may be real or complex, nonsymmetric, and its sparsity pattern needs not to be symmetric. The serial computation parts are based on the serial sparse BLAS, so that any extension made to the data structures of the serial kernels is available to the parallel version. The overall design and parallelization strategy have been influenced by the structure of the ScaLAPACK parallel library~\cite{scalapack}. The layered structure of the PSBLAS library is shown in figure~\ref{fig:psblas} ; lower layers of the library indicate an encapsulation relationship with upper layers. The ongoing discussion focuses on the Fortran~95 layer immediately below the application layer. The serial parts of the computation on each process are executed through calls to the serial sparse BLAS subroutines. In a similar way, the inter-process message exchanges are implemented through the Basic Linear Algebra Communication Subroutines (BLACS) library~\cite{BLACS} that guarantees a portable and efficient communication layer. The Message Passing Interface code is encapsulated within the BLACS layer. However, in some cases, MPI routines are directly used either to improve efficiency or to implement communication patterns for which the BLACS package doesn't provide any method. \begin{figure}[h] \begin{center} \includegraphics[scale=0.45]{figures/psblas} \end{center} \caption{PSBLAS library components hierarchy.\label{fig:psblas}} \end{figure} The PSBLAS library consists of two classes of subroutines that is, the {\em computational routines} and the {\em auxiliary routines}. The computational routine set includes: \begin{itemize} \item Sparse matrix by dense matrix product; \item Sparse triangular systems solution for block diagonal matrices; \item Vector and matrix norms; \item Dense matrix sums; \item Dot products. \end{itemize} The auxiliary routine set includes: \begin{itemize} \item Communication descriptors allocation; \item Dense and sparse matrix allocation; \item Dense and sparse matrix build and update; \item Sparse matrix and data distribution preprocessing. \end{itemize} The following naming scheme has been adopted for all the symbols internally defined in the PSBLAS software package: \begin{itemize} \item all the symbols (i.e. subroutine names, data types...) are prefixed by \verb|psb_| \item all the data type names are suffixed by \verb|_type| \item all the constant values are suffixed by \verb|_| \item all the subroutine names follow the rule \verb|psb_xxname| where \verb|xx| can be either: \begin{itemize} \item \verb|ge|: the routine is related to dense data, \item \verb|sp|: the routine is related to sparse data, \item \verb|cd|: the routine is related to communication descriptor (see~\ref{sec:datastruct}). \end{itemize} For example the \verb|psb_geins|, \verb|psb_spins| and \verb|psb_cdins| perform the same action (see~\ref{sec:toolsrout}) on dense matrices, sparse matrices and communication descriptors respectively. Interface overloading allows the usage of the same subroutine interfaces for both real and complex data. \end{itemize} In the description of the subroutines, arguments or argument entries are classified as: \begin{description} \item[global] For input arguments, the value must be the same on all processes participating in the subroutine call; for output arguments the value is guaranteed to be the same. \item[local] Each process has its own value(s) independently. \end{description} %%% Local Variables: %%% mode: latex %%% TeX-master: "userguide" %%% End: