Application structure

The main underlying principle of the PSBLAS library is that the library objects are created and exist with reference to a discretized space to which there corresponds an index space and a matrix sparsity pattern. As an example, consider a cell-centered finite-volume discretization of the Navier-Stokes equations on a simulation domain; the index space $1\dots n$ is isomorphic to the set of cell centers, whereas the pattern of the associated linear system matrix is isomorphic to the adjacency graph imposed on the discretization mesh by the discretization stencil.

Thus the first order of business is to establish an index space, and this is done with a call to psb_cdall in which we specify the size of the index space $n$ and the allocation of the elements of the index space to the various processes making up the MPI (virtual) parallel machine.

The index space is partitioned among processes, and this creates a mapping from the “global” numbering $1\dots n$ to a numbering “local” to each process; each process $i$ will own a certain subset $1\dots n_{\hbox{row}_i}$, each element of which corresponds to a certain element of $1\dots n$. The user does not set explicitly this mapping; when the application needs to indicate to which element of the index space a certain item is related, such as the row and column index of a matrix coefficient, it does so in the “global” numbering, and the library will translate into the appropriate “local” numbering.

For a given index space $1\dots n$ there are many possible associated topologies, i.e. many different discretization stencils; thus the description of the index space is not completed until the user has defined a sparsity pattern, either explicitly through psb_cdins or implicitly through psb_spins. The descriptor is finalized with a call to psb_cdasb and a sparse matrix with a call to psb_spasb. After psb_cdasb each process $i$ will have defined a set of “halo” (or “ghost”) indices $n_{\hbox{row}_i}+1\dots n_{\hbox{col}_i}$, denoting elements of the index space that are not assigned to process $i$; however the variables associated with them are needed to complete computations associated with the sparse matrix $A$, and thus they have to be fetched from (neighbouring) processes. The descriptor of the index space is built exactly for the purpose of properly sequencing the communication steps required to achieve this objective.

A simple application structure will walk through the index space allocation, matrix/vector creation and linear system solution as follows:

  1. Initialize parallel environment with psb_init
  2. Initialize index space with psb_cdall
  3. Allocate sparse matrix and dense vectors with psb_spall and psb_geall
  4. Loop over all local rows, generate matrix and vector entries, and insert them with psb_spins and psb_geins
  5. Assemble the various entities:
    1. psb_cdasb
    2. psb_spasb
    3. psb_geasb
  6. Choose the preconditioner to be used with prec%init and build it with prec%build3.
  7. Call the iterative method of choice, e.g. psb_bicgstab
This is the structure of the sample program test/pargen/psb_d_pde3d.f90.

For a simulation in which the same discretization mesh is used over multiple time steps, the following structure may be more appropriate:

  1. Initialize parallel environment with psb_init
  2. Initialize index space with psb_cdall
  3. Loop over the topology of the discretization mesh and build the descriptor with psb_cdins
  4. Assemble the descriptor with psb_cdasb
  5. Allocate the sparse matrices and dense vectors with psb_spall and psb_geall
  6. Loop over the time steps:
    1. If after first time step, reinitialize the sparse matrix with psb_sprn; also zero out the dense vectors;
    2. Loop over the mesh, generate the coefficients and insert/update them with psb_spins and psb_geins
    3. Assemble with psb_spasb and psb_geasb
    4. Choose and build preconditioner with prec%init and prec%build
    5. Call the iterative method of choice, e.g. psb_bicgstab
The insertion routines will be called as many times as needed; they only need to be called on the data that is actually allocated to the current process, i.e. each process generates its own data.

In principle there is no specific order in the calls to psb_spins, nor is there a requirement to build a matrix row in its entirety before calling the routine; this allows the application programmer to walk through the discretization mesh element by element, generating the main part of a given matrix row but also contributions to the rows corresponding to neighbouring elements.

From a functional point of view it is even possible to execute one call for each nonzero coefficient; however this would have a substantial computational overhead. It is therefore advisable to pack a certain amount of data into each call to the insertion routine, say touching on a few tens of rows; the best performng value would depend on both the architecture of the computer being used and on the problem structure. At the opposite extreme, it would be possible to generate the entire part of a coefficient matrix residing on a process and pass it in a single call to psb_spins; this, however, would entail a doubling of memory occupation, and thus would be almost always far from optimal.



Subsections