From dad4d3b894349479927057c7d8c83f4b2c278e60 Mon Sep 17 00:00:00 2001 From: Stack-1 Date: Sat, 13 Jun 2026 15:28:23 +0200 Subject: [PATCH] docs: add developer guide internals layer Introduce docs/internals/ as a two-layer complement to the user manual: - README.md: index plus the user-manual vs developer-guide rationale - communication.md: the psb_halo/psb_ovrl -> psi_swapdata dispatch path, the psb_comm_handle_type hierarchy and factory, the five MPI schemes (isend/irecv, neighbor alltoallv, persistent, RMA pull/push), and the start/wait/sync swap-status state machine --- docs/internals/README.md | 26 +++++ docs/internals/communication.md | 179 ++++++++++++++++++++++++++++++++ 2 files changed, 205 insertions(+) create mode 100644 docs/internals/README.md create mode 100644 docs/internals/communication.md diff --git a/docs/internals/README.md b/docs/internals/README.md new file mode 100644 index 000000000..be5ea3896 --- /dev/null +++ b/docs/internals/README.md @@ -0,0 +1,26 @@ +# PSBLAS Developer Guide (Internals) + +This directory documents the **internal architecture** of PSBLAS: how the +classes relate to each other and how the library is implemented. It is aimed +at developers who modify or extend PSBLAS, not at end users. + +The split is deliberate and two-layered: + +- **User manual** (`docs/src/*.tex`, built into `psblas-3.9.pdf`) describes the + public API: what each routine does, its arguments, and the semantics a user + must know to call it correctly. For example, it documents that `psb_halo` + accepts a `mode` flag and what synchronous vs. split-phase exchange means for + the caller. +- **Developer guide** (this directory) describes *how* those features are + implemented: the dispatch path, the communication-handle class hierarchy, the + MPI mechanisms behind each scheme, and the extension points. A user never + needs to read this to use `psb_halo`. + +When you add a feature, update the layer that matches its audience. A new public +flag goes in the user manual; a new internal communication scheme goes here. + +## Contents + +- [communication.md](communication.md) — the communication subsystem: the + `psb_halo`/`psb_ovrl` → `psi_swapdata` → communication-handle dispatch path, + the available MPI schemes, and the swap-status state machine. diff --git a/docs/internals/communication.md b/docs/internals/communication.md new file mode 100644 index 000000000..f4cc264db --- /dev/null +++ b/docs/internals/communication.md @@ -0,0 +1,179 @@ +# The communication subsystem + +This page describes how a halo (or overlap) exchange is actually carried out +inside PSBLAS, from the public routine down to the MPI calls. It covers the +dispatch path, the communication-handle class hierarchy, the available MPI +schemes, and the swap-status state machine introduced on the +`communication_v2` branch. + +> **Audience:** developers modifying or extending PSBLAS communication. +> End users only need the `psb_halo` / `psb_ovrl` entries in the user manual. + +## 1. The call stack + +A data exchange flows through four layers. The type-specific names below use the +`d` (double) variant; the same pattern exists for `s/c/z/i/l` etc. + +``` +psb_halo / psb_ovrl base/comm/psb_dhalo.f90, psb_dovrl.f90 + | (public, type-bound generic; one per data type) + v +psi_swapdata / psi_swaptran base/comm/internals/psi_dswapdata.F90 + | (generic internal exchange; resolves the descriptor index list, + | allocates/looks up the communication handle, dispatches on scheme) + v +psi_dswap__vect same file / helper routines + | (one routine per MPI scheme: baseline, neighbor, persistent, rma) + v +MPI Isend/Irecv, Ineighbor_alltoallv, RMA, ... +``` + +`psb_halo` itself does very little: it validates the vector, picks the index +list selector (`data`, default `psb_comm_halo_`), the transpose flag (`tran`) +and the communication mode (`mode`, default `psb_comm_status_sync_`), then calls +`psi_swapdata` (or `psi_swaptran` for the transposed exchange). + +The real dispatch happens in `psi_dswapdata_vect` +([base/comm/internals/psi_dswapdata.F90](../../base/comm/internals/psi_dswapdata.F90)): + +1. `desc_a%get_list_p(data_, comm_indexes, num_neighbors, total_recv, total_send, info)` + resolves the requested index list (halo / overlap / ext / mov) into the + per-neighbour send/recv structure. +2. The `swap_status` argument (`mode`) is validated: it must be one of + `psb_comm_status_start_`, `psb_comm_status_wait_`, `psb_comm_status_sync_`. +3. If the vector does not yet carry a communication handle, one is built from + the descriptor's scheme: `call psb_comm_set(desc_a%comm_type, y%comm_handle, info)`. +4. The status is stored on the handle (`y%comm_handle%set_swap_status`). +5. A `select case (y%comm_handle%comm_type)` dispatches to the matching + `psi_dswap__vect` implementation. + +## 2. The communication-handle abstraction + +Each communication scheme is a class derived from the abstract type +`psb_comm_handle_type` +([base/modules/comm/comm_schemes/psb_comm_schemes_mod.F90](../../base/modules/comm/comm_schemes/psb_comm_schemes_mod.F90)): + +```fortran +type, abstract :: psb_comm_handle_type + integer(psb_ipk_) :: id = -1 + integer(psb_ipk_) :: comm_type = psb_comm_unknown_ + integer(psb_ipk_) :: swap_status = psb_comm_status_unknown_ +contains + procedure(psb_comm_set), deferred :: init + procedure(psb_comm_free), deferred :: free + procedure(psb_comm_set_swap_status), deferred :: set_swap_status + procedure(psb_comm_get_swap_status), deferred :: get_swap_status +end type +``` + +The handle stores all state that must survive between a split-phase `start` and +its matching `wait`: MPI requests, buffers, communicators, windows. It is +**cached on the vector** (`y%comm_handle`), so repeated exchanges of the same +vector reuse the same buffers and the same (possibly persistent) MPI objects. + +A small **factory** builds and recycles handles +([base/modules/comm/comm_schemes/psb_comm_factory_mod.F90](../../base/modules/comm/comm_schemes/psb_comm_factory_mod.F90)): + +- `psb_comm_set(comm_type, handle, info)` — allocate (or re-initialise in place) + the concrete handle matching the integer `comm_type`. If the handle already + exists with the same `comm_type` it is reset rather than reallocated, + preserving `id` and `swap_status`. +- `psb_comm_free(handle, info)` — release MPI resources and deallocate. +- `psb_comm_set_swap_status` / `psb_comm_get_swap_status` — thin wrappers over + the type-bound methods. + +## 3. The available schemes + +The concrete schemes are enumerated in `psb_comm_schemes_mod` and implemented in +one module each: + +| `comm_type` value | Handle type | Module | MPI mechanism | +|---|---|---|---| +| `psb_comm_isend_irecv_` | `psb_comm_baseline_handle` | `psb_comm_baseline_mod.F90` | Point-to-point non-blocking `MPI_Isend` / `MPI_Irecv`; request IDs kept in `comid(:,:)`. This is the default and the historical PSBLAS behaviour. | +| `psb_comm_ineighbor_alltoallv_` | `psb_comm_neighbor_handle` | `psb_comm_neighbor_impl_mod.F90` | Distributed-graph (neighbourhood) collective `MPI_Ineighbor_alltoallv` over an `MPI_Dist_graph` communicator built from the true neighbours. | +| `psb_comm_persistent_ineighbor_alltoallv_` | `psb_comm_neighbor_handle` (with `use_persistent_buffers = .true.`) | `psb_comm_neighbor_impl_mod.F90` | As above, but using a **persistent** neighbour collective request initialised once and reused across exchanges. | +| `psb_comm_rma_pull_` | `psb_comm_rma_handle` | `psb_comm_rma_mod.F90` | One-sided RMA over an `MPI_Win`; each process *gets* (pulls) its halo from peers. | +| `psb_comm_rma_push_` | `psb_comm_rma_handle` | `psb_comm_rma_mod.F90` | One-sided RMA over an `MPI_Win`; each process *puts* (pushes) data into peers. | + +All three handle types extend `psb_comm_handle_type` and add scheme-specific +state, for example: + +- **baseline** — `comid(num_neighbors, 2)`, the Isend/Irecv request IDs. +- **neighbor** — `graph_comm`, per-neighbour `send_counts/recv_counts` and + displacements, contiguous `send_indexes/recv_indexes`, plus persistent-request + bookkeeping (`persistent_request`, `persistent_in_flight`, ...). The topology + is built lazily on the first exchange and reused. +- **rma** — `win`, the peer layout arrays (`peer_send_*`, `peer_recv_*`, + `peer_remote_*_displs`) and notification buffers. + +## 4. The swap-status state machine + +`mode` selects whether an exchange is a single synchronous operation or is split +into two phases so that communication can be overlapped with computation. The +status values live in `psb_comm_schemes_mod`: + +``` +psb_comm_status_unknown_ (handle not yet used) +psb_comm_status_start_ (post sends/recvs and return) +psb_comm_status_wait_ (complete a previously started exchange) +psb_comm_status_sync_ (start + wait in one call; the default) +``` + +```mermaid +stateDiagram-v2 + [*] --> unknown + unknown --> sync: mode = sync + unknown --> start: mode = start + sync --> sync: repeated synchronous exchanges + start --> wait: mode = wait + wait --> start: next split-phase exchange + wait --> sync: switch back to synchronous +``` + +In the scheme implementations the two phases map onto the obvious MPI pairs, for +example in the baseline scheme: + +```fortran +do_send = (swap_status == psb_comm_status_start_) .or. (swap_status == psb_comm_status_sync_) +do_recv = (swap_status == psb_comm_status_wait_) .or. (swap_status == psb_comm_status_sync_) +``` + +so `sync` posts and completes in one call, while `start` only posts and `wait` +only completes. **Contract:** between a `start` and its matching `wait` the halo +entries of the vector must not be read or written, and the same vector (which +carries the handle and its in-flight requests) must be passed to both calls. + +> Numerical compatibility note: the legacy `psb_swap_send_`/`psb_swap_recv_` bit +> flags still exist in `psb_desc_const_mod`. `IOR(psb_swap_send_, psb_swap_recv_)` +> equals 3, which is the same integer as `psb_comm_status_sync_`, so old callers +> that passed the OR-ed flags keep getting a synchronous exchange. + +## 5. Selecting a scheme + +The scheme is a property of the **descriptor**, not of the call. The default is +`psb_comm_isend_irecv_` (set in `psb_desc_type`, +[base/modules/desc/psb_desc_mod.F90](../../base/modules/desc/psb_desc_mod.F90)). +To change it: + +```fortran +call desc%set_comm_scheme(psb_comm_ineighbor_alltoallv_, info) +``` + +This only records `desc%comm_type`; the matching handle is built lazily on the +next `psi_swapdata` call (step 3 in §1). Because the choice is global to the +descriptor and orthogonal to the public API, it is intentionally **not** part of +the user manual: end users get correct behaviour from the default, and the +scheme is an advanced tuning knob. + +## 6. Adding a new scheme + +1. Add an enumerator to `psb_comm_schemes_mod` (`psb_comm_*_`). +2. Create a module `psb_comm__mod.F90` with a type extending + `psb_comm_handle_type` and implementing the four deferred methods + (`init`, `free`, `set_swap_status`, `get_swap_status`). +3. Register it in the factory `psb_comm_set` (`psb_comm_factory_mod.F90`). +4. Add a `psi_swap__vect` implementation and a `case` for it in the + `select case` of `psi_swapdata_vect` for every data type, honouring the + start/wait/sync contract. +5. Wire the new module into the build (`base/modules/Makefile` / + `base/CMakeLists.txt`).