docs: add developer guide internals layer

Introduce docs/internals/ as a two-layer complement to the user manual:
- README.md: index plus the user-manual vs developer-guide rationale
- communication.md: the psb_halo/psb_ovrl -> psi_swapdata dispatch path,
  the psb_comm_handle_type hierarchy and factory, the five MPI schemes
  (isend/irecv, neighbor alltoallv, persistent, RMA pull/push), and the
  start/wait/sync swap-status state machine
communication_v2
Stack-1 3 days ago
parent a140a1050c
commit dad4d3b894

@ -0,0 +1,26 @@
# PSBLAS Developer Guide (Internals)
This directory documents the **internal architecture** of PSBLAS: how the
classes relate to each other and how the library is implemented. It is aimed
at developers who modify or extend PSBLAS, not at end users.
The split is deliberate and two-layered:
- **User manual** (`docs/src/*.tex`, built into `psblas-3.9.pdf`) describes the
public API: what each routine does, its arguments, and the semantics a user
must know to call it correctly. For example, it documents that `psb_halo`
accepts a `mode` flag and what synchronous vs. split-phase exchange means for
the caller.
- **Developer guide** (this directory) describes *how* those features are
implemented: the dispatch path, the communication-handle class hierarchy, the
MPI mechanisms behind each scheme, and the extension points. A user never
needs to read this to use `psb_halo`.
When you add a feature, update the layer that matches its audience. A new public
flag goes in the user manual; a new internal communication scheme goes here.
## Contents
- [communication.md](communication.md) — the communication subsystem: the
`psb_halo`/`psb_ovrl` → `psi_swapdata` → communication-handle dispatch path,
the available MPI schemes, and the swap-status state machine.

@ -0,0 +1,179 @@
# The communication subsystem
This page describes how a halo (or overlap) exchange is actually carried out
inside PSBLAS, from the public routine down to the MPI calls. It covers the
dispatch path, the communication-handle class hierarchy, the available MPI
schemes, and the swap-status state machine introduced on the
`communication_v2` branch.
> **Audience:** developers modifying or extending PSBLAS communication.
> End users only need the `psb_halo` / `psb_ovrl` entries in the user manual.
## 1. The call stack
A data exchange flows through four layers. The type-specific names below use the
`d` (double) variant; the same pattern exists for `s/c/z/i/l` etc.
```
psb_halo / psb_ovrl base/comm/psb_dhalo.f90, psb_dovrl.f90
| (public, type-bound generic; one per data type)
v
psi_swapdata / psi_swaptran base/comm/internals/psi_dswapdata.F90
| (generic internal exchange; resolves the descriptor index list,
| allocates/looks up the communication handle, dispatches on scheme)
v
psi_dswap_<scheme>_vect same file / helper routines
| (one routine per MPI scheme: baseline, neighbor, persistent, rma)
v
MPI Isend/Irecv, Ineighbor_alltoallv, RMA, ...
```
`psb_halo` itself does very little: it validates the vector, picks the index
list selector (`data`, default `psb_comm_halo_`), the transpose flag (`tran`)
and the communication mode (`mode`, default `psb_comm_status_sync_`), then calls
`psi_swapdata` (or `psi_swaptran` for the transposed exchange).
The real dispatch happens in `psi_dswapdata_vect`
([base/comm/internals/psi_dswapdata.F90](../../base/comm/internals/psi_dswapdata.F90)):
1. `desc_a%get_list_p(data_, comm_indexes, num_neighbors, total_recv, total_send, info)`
resolves the requested index list (halo / overlap / ext / mov) into the
per-neighbour send/recv structure.
2. The `swap_status` argument (`mode`) is validated: it must be one of
`psb_comm_status_start_`, `psb_comm_status_wait_`, `psb_comm_status_sync_`.
3. If the vector does not yet carry a communication handle, one is built from
the descriptor's scheme: `call psb_comm_set(desc_a%comm_type, y%comm_handle, info)`.
4. The status is stored on the handle (`y%comm_handle%set_swap_status`).
5. A `select case (y%comm_handle%comm_type)` dispatches to the matching
`psi_dswap_<scheme>_vect` implementation.
## 2. The communication-handle abstraction
Each communication scheme is a class derived from the abstract type
`psb_comm_handle_type`
([base/modules/comm/comm_schemes/psb_comm_schemes_mod.F90](../../base/modules/comm/comm_schemes/psb_comm_schemes_mod.F90)):
```fortran
type, abstract :: psb_comm_handle_type
integer(psb_ipk_) :: id = -1
integer(psb_ipk_) :: comm_type = psb_comm_unknown_
integer(psb_ipk_) :: swap_status = psb_comm_status_unknown_
contains
procedure(psb_comm_set), deferred :: init
procedure(psb_comm_free), deferred :: free
procedure(psb_comm_set_swap_status), deferred :: set_swap_status
procedure(psb_comm_get_swap_status), deferred :: get_swap_status
end type
```
The handle stores all state that must survive between a split-phase `start` and
its matching `wait`: MPI requests, buffers, communicators, windows. It is
**cached on the vector** (`y%comm_handle`), so repeated exchanges of the same
vector reuse the same buffers and the same (possibly persistent) MPI objects.
A small **factory** builds and recycles handles
([base/modules/comm/comm_schemes/psb_comm_factory_mod.F90](../../base/modules/comm/comm_schemes/psb_comm_factory_mod.F90)):
- `psb_comm_set(comm_type, handle, info)` — allocate (or re-initialise in place)
the concrete handle matching the integer `comm_type`. If the handle already
exists with the same `comm_type` it is reset rather than reallocated,
preserving `id` and `swap_status`.
- `psb_comm_free(handle, info)` — release MPI resources and deallocate.
- `psb_comm_set_swap_status` / `psb_comm_get_swap_status` — thin wrappers over
the type-bound methods.
## 3. The available schemes
The concrete schemes are enumerated in `psb_comm_schemes_mod` and implemented in
one module each:
| `comm_type` value | Handle type | Module | MPI mechanism |
|---|---|---|---|
| `psb_comm_isend_irecv_` | `psb_comm_baseline_handle` | `psb_comm_baseline_mod.F90` | Point-to-point non-blocking `MPI_Isend` / `MPI_Irecv`; request IDs kept in `comid(:,:)`. This is the default and the historical PSBLAS behaviour. |
| `psb_comm_ineighbor_alltoallv_` | `psb_comm_neighbor_handle` | `psb_comm_neighbor_impl_mod.F90` | Distributed-graph (neighbourhood) collective `MPI_Ineighbor_alltoallv` over an `MPI_Dist_graph` communicator built from the true neighbours. |
| `psb_comm_persistent_ineighbor_alltoallv_` | `psb_comm_neighbor_handle` (with `use_persistent_buffers = .true.`) | `psb_comm_neighbor_impl_mod.F90` | As above, but using a **persistent** neighbour collective request initialised once and reused across exchanges. |
| `psb_comm_rma_pull_` | `psb_comm_rma_handle` | `psb_comm_rma_mod.F90` | One-sided RMA over an `MPI_Win`; each process *gets* (pulls) its halo from peers. |
| `psb_comm_rma_push_` | `psb_comm_rma_handle` | `psb_comm_rma_mod.F90` | One-sided RMA over an `MPI_Win`; each process *puts* (pushes) data into peers. |
All three handle types extend `psb_comm_handle_type` and add scheme-specific
state, for example:
- **baseline**`comid(num_neighbors, 2)`, the Isend/Irecv request IDs.
- **neighbor**`graph_comm`, per-neighbour `send_counts/recv_counts` and
displacements, contiguous `send_indexes/recv_indexes`, plus persistent-request
bookkeeping (`persistent_request`, `persistent_in_flight`, ...). The topology
is built lazily on the first exchange and reused.
- **rma**`win`, the peer layout arrays (`peer_send_*`, `peer_recv_*`,
`peer_remote_*_displs`) and notification buffers.
## 4. The swap-status state machine
`mode` selects whether an exchange is a single synchronous operation or is split
into two phases so that communication can be overlapped with computation. The
status values live in `psb_comm_schemes_mod`:
```
psb_comm_status_unknown_ (handle not yet used)
psb_comm_status_start_ (post sends/recvs and return)
psb_comm_status_wait_ (complete a previously started exchange)
psb_comm_status_sync_ (start + wait in one call; the default)
```
```mermaid
stateDiagram-v2
[*] --> unknown
unknown --> sync: mode = sync
unknown --> start: mode = start
sync --> sync: repeated synchronous exchanges
start --> wait: mode = wait
wait --> start: next split-phase exchange
wait --> sync: switch back to synchronous
```
In the scheme implementations the two phases map onto the obvious MPI pairs, for
example in the baseline scheme:
```fortran
do_send = (swap_status == psb_comm_status_start_) .or. (swap_status == psb_comm_status_sync_)
do_recv = (swap_status == psb_comm_status_wait_) .or. (swap_status == psb_comm_status_sync_)
```
so `sync` posts and completes in one call, while `start` only posts and `wait`
only completes. **Contract:** between a `start` and its matching `wait` the halo
entries of the vector must not be read or written, and the same vector (which
carries the handle and its in-flight requests) must be passed to both calls.
> Numerical compatibility note: the legacy `psb_swap_send_`/`psb_swap_recv_` bit
> flags still exist in `psb_desc_const_mod`. `IOR(psb_swap_send_, psb_swap_recv_)`
> equals 3, which is the same integer as `psb_comm_status_sync_`, so old callers
> that passed the OR-ed flags keep getting a synchronous exchange.
## 5. Selecting a scheme
The scheme is a property of the **descriptor**, not of the call. The default is
`psb_comm_isend_irecv_` (set in `psb_desc_type`,
[base/modules/desc/psb_desc_mod.F90](../../base/modules/desc/psb_desc_mod.F90)).
To change it:
```fortran
call desc%set_comm_scheme(psb_comm_ineighbor_alltoallv_, info)
```
This only records `desc%comm_type`; the matching handle is built lazily on the
next `psi_swapdata` call (step 3 in §1). Because the choice is global to the
descriptor and orthogonal to the public API, it is intentionally **not** part of
the user manual: end users get correct behaviour from the default, and the
scheme is an advanced tuning knob.
## 6. Adding a new scheme
1. Add an enumerator to `psb_comm_schemes_mod` (`psb_comm_*_`).
2. Create a module `psb_comm_<name>_mod.F90` with a type extending
`psb_comm_handle_type` and implementing the four deferred methods
(`init`, `free`, `set_swap_status`, `get_swap_status`).
3. Register it in the factory `psb_comm_set` (`psb_comm_factory_mod.F90`).
4. Add a `psi_<x>swap_<name>_vect` implementation and a `case` for it in the
`select case` of `psi_<x>swapdata_vect` for every data type, honouring the
start/wait/sync contract.
5. Wire the new module into the build (`base/modules/Makefile` /
`base/CMakeLists.txt`).
Loading…
Cancel
Save