You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
182 lines
9.1 KiB
Markdown
182 lines
9.1 KiB
Markdown
# The communication subsystem
|
|
|
|
This page describes how a halo (or overlap) exchange is actually carried out
|
|
inside PSBLAS, from the public routine down to the MPI calls. It covers the
|
|
dispatch path, the communication-handle class hierarchy, the available MPI
|
|
schemes, and the swap-status state machine introduced on the
|
|
`communication_v2` branch.
|
|
|
|
> **Audience:** developers modifying or extending PSBLAS communication.
|
|
> End users only need the `psb_halo` / `psb_ovrl` entries in the user manual.
|
|
|
|
## 1. The call stack
|
|
|
|
A data exchange flows through four layers. The type-specific names below use the
|
|
`d` (double) variant; the same pattern exists for `s/c/z/i/l` etc.
|
|
|
|
```
|
|
psb_halo / psb_ovrl base/comm/psb_dhalo.f90, psb_dovrl.f90
|
|
| (public, type-bound generic; one per data type)
|
|
v
|
|
psi_swapdata / psi_swaptran base/comm/internals/psi_dswapdata.F90
|
|
| (generic internal exchange; resolves the descriptor index list,
|
|
| allocates/looks up the communication handle, dispatches on scheme)
|
|
v
|
|
psi_dswap_<scheme>_vect same file / helper routines
|
|
| (one routine per MPI scheme: baseline, neighbor, persistent, rma)
|
|
v
|
|
MPI Isend/Irecv, Ineighbor_alltoallv, RMA, ...
|
|
```
|
|
|
|
`psb_halo` itself does very little: it validates the vector, picks the index
|
|
list selector (`data`, default `psb_comm_halo_`), the transpose flag (`tran`)
|
|
and the communication mode (`mode`, default `psb_comm_status_sync_`), then calls
|
|
`psi_swapdata` (or `psi_swaptran` for the transposed exchange).
|
|
|
|
The real dispatch happens in `psi_dswapdata_vect`
|
|
([base/comm/internals/psi_dswapdata.F90](../../base/comm/internals/psi_dswapdata.F90)):
|
|
|
|
1. `desc_a%get_list_p(data_, comm_indexes, num_neighbors, total_recv, total_send, info)`
|
|
resolves the requested index list (halo / overlap / ext / mov) into the
|
|
per-neighbour send/recv structure.
|
|
2. The `swap_status` argument (`mode`) is validated: it must be one of
|
|
`psb_comm_status_start_`, `psb_comm_status_wait_`, `psb_comm_status_sync_`.
|
|
3. If the vector does not yet carry a communication handle, one is built from
|
|
the descriptor's scheme: `call psb_comm_set(desc_a%comm_type, y%comm_handle, info)`.
|
|
4. The status is stored on the handle (`y%comm_handle%set_swap_status`).
|
|
5. A `select case (y%comm_handle%comm_type)` dispatches to the matching
|
|
`psi_dswap_<scheme>_vect` implementation.
|
|
|
|
## 2. The communication-handle abstraction
|
|
|
|
Each communication scheme is a class derived from the abstract type
|
|
`psb_comm_handle_type`
|
|
([base/modules/comm/comm_schemes/psb_comm_schemes_mod.F90](../../base/modules/comm/comm_schemes/psb_comm_schemes_mod.F90)):
|
|
|
|
```fortran
|
|
type, abstract :: psb_comm_handle_type
|
|
integer(psb_ipk_) :: id = -1
|
|
integer(psb_ipk_) :: comm_type = psb_comm_unknown_
|
|
integer(psb_ipk_) :: swap_status = psb_comm_status_unknown_
|
|
contains
|
|
procedure(psb_comm_set), deferred :: init
|
|
procedure(psb_comm_free), deferred :: free
|
|
procedure(psb_comm_set_swap_status), deferred :: set_swap_status
|
|
procedure(psb_comm_get_swap_status), deferred :: get_swap_status
|
|
end type
|
|
```
|
|
|
|
The handle stores all state that must survive between a split-phase `start` and
|
|
its matching `wait`: MPI requests, buffers, communicators, windows. It is
|
|
**cached on the vector** (`y%comm_handle`), so repeated exchanges of the same
|
|
vector reuse the same buffers and the same (possibly persistent) MPI objects.
|
|
|
|
A small **factory** builds and recycles handles
|
|
([base/modules/comm/comm_schemes/psb_comm_factory_mod.F90](../../base/modules/comm/comm_schemes/psb_comm_factory_mod.F90)):
|
|
|
|
- `psb_comm_set(comm_type, handle, info)` — allocate (or re-initialise in place)
|
|
the concrete handle matching the integer `comm_type`. If the handle already
|
|
exists with the same `comm_type` it is reset rather than reallocated,
|
|
preserving `id` and `swap_status`.
|
|
- `psb_comm_free(handle, info)` — release MPI resources and deallocate.
|
|
- `psb_comm_set_swap_status` / `psb_comm_get_swap_status` — thin wrappers over
|
|
the type-bound methods.
|
|
|
|
## 3. The available schemes
|
|
|
|
The concrete schemes are enumerated in `psb_comm_schemes_mod` and implemented in
|
|
one module each:
|
|
|
|
| `comm_type` value | Handle type | Module | MPI mechanism |
|
|
|---|---|---|---|
|
|
| `psb_comm_isend_irecv_` | `psb_comm_baseline_handle` | `psb_comm_baseline_mod.F90` | Point-to-point non-blocking `MPI_Isend` / `MPI_Irecv`; request IDs kept in `comid(:,:)`. This is the default and the historical PSBLAS behaviour. |
|
|
| `psb_comm_ineighbor_alltoallv_` | `psb_comm_neighbor_handle` | `psb_comm_neighbor_impl_mod.F90` | Distributed-graph (neighbourhood) collective `MPI_Ineighbor_alltoallv` over an `MPI_Dist_graph` communicator built from the true neighbours. |
|
|
| `psb_comm_persistent_ineighbor_alltoallv_` | `psb_comm_neighbor_handle` (with `use_persistent_buffers = .true.`) | `psb_comm_neighbor_impl_mod.F90` | As above, but using a **persistent** neighbour collective request initialised once and reused across exchanges. |
|
|
| `psb_comm_rma_pull_` | `psb_comm_rma_handle` | `psb_comm_rma_mod.F90` | One-sided RMA over an `MPI_Win`; each process *gets* (pulls) its halo from peers. |
|
|
| `psb_comm_rma_push_` | `psb_comm_rma_handle` | `psb_comm_rma_mod.F90` | One-sided RMA over an `MPI_Win`; each process *puts* (pushes) data into peers. |
|
|
|
|
All three handle types extend `psb_comm_handle_type` and add scheme-specific
|
|
state, for example:
|
|
|
|
- **baseline** — `comid(num_neighbors, 2)`, the Isend/Irecv request IDs.
|
|
- **neighbor** — `graph_comm`, per-neighbour `send_counts/recv_counts` and
|
|
displacements, contiguous `send_indexes/recv_indexes`, plus persistent-request
|
|
bookkeeping (`persistent_request`, `persistent_in_flight`, ...). The topology
|
|
is built lazily on the first exchange and reused.
|
|
- **rma** — `win`, the peer layout arrays (`peer_send_*`, `peer_recv_*`,
|
|
`peer_remote_*_displs`) and notification buffers.
|
|
|
|
## 4. The swap-status state machine
|
|
|
|
`mode` selects whether an exchange is a single synchronous operation or is split
|
|
into two phases so that communication can be overlapped with computation. The
|
|
status values live in `psb_comm_schemes_mod`:
|
|
|
|
```
|
|
psb_comm_status_unknown_ (handle not yet used)
|
|
psb_comm_status_start_ (post sends/recvs and return)
|
|
psb_comm_status_wait_ (complete a previously started exchange)
|
|
psb_comm_status_sync_ (start + wait in one call; the default)
|
|
```
|
|
|
|
```
|
|
synchronous (default): unknown --> sync --> sync --> ...
|
|
split-phase: unknown --> start --> wait --> start --> wait --> ...
|
|
|
|
transitions (driven by the mode argument):
|
|
unknown --(mode=sync)--> sync single-call exchange (post + complete)
|
|
unknown --(mode=start)--> start post sends/recvs, then return
|
|
start --(mode=wait)--> wait complete the started exchange
|
|
wait --(mode=start)--> start begin the next split-phase exchange
|
|
wait --(mode=sync)--> sync switch back to synchronous
|
|
sync --(mode=sync)--> sync repeated synchronous exchanges
|
|
```
|
|
|
|
In the scheme implementations the two phases map onto the obvious MPI pairs, for
|
|
example in the baseline scheme:
|
|
|
|
```fortran
|
|
do_send = (swap_status == psb_comm_status_start_) .or. (swap_status == psb_comm_status_sync_)
|
|
do_recv = (swap_status == psb_comm_status_wait_) .or. (swap_status == psb_comm_status_sync_)
|
|
```
|
|
|
|
so `sync` posts and completes in one call, while `start` only posts and `wait`
|
|
only completes. **Contract:** between a `start` and its matching `wait` the halo
|
|
entries of the vector must not be read or written, and the same vector (which
|
|
carries the handle and its in-flight requests) must be passed to both calls.
|
|
|
|
> Numerical compatibility note: the legacy `psb_swap_send_`/`psb_swap_recv_` bit
|
|
> flags still exist in `psb_desc_const_mod`. `IOR(psb_swap_send_, psb_swap_recv_)`
|
|
> equals 3, which is the same integer as `psb_comm_status_sync_`, so old callers
|
|
> that passed the OR-ed flags keep getting a synchronous exchange.
|
|
|
|
## 5. Selecting a scheme
|
|
|
|
The scheme is a property of the **descriptor**, not of the call. The default is
|
|
`psb_comm_isend_irecv_` (set in `psb_desc_type`,
|
|
[base/modules/desc/psb_desc_mod.F90](../../base/modules/desc/psb_desc_mod.F90)).
|
|
To change it:
|
|
|
|
```fortran
|
|
call desc%set_comm_scheme(psb_comm_ineighbor_alltoallv_, info)
|
|
```
|
|
|
|
This only records `desc%comm_type`; the matching handle is built lazily on the
|
|
next `psi_swapdata` call (step 3 in §1). Because the choice is global to the
|
|
descriptor and orthogonal to the public API, it is intentionally **not** part of
|
|
the user manual: end users get correct behaviour from the default, and the
|
|
scheme is an advanced tuning knob.
|
|
|
|
## 6. Adding a new scheme
|
|
|
|
1. Add an enumerator to `psb_comm_schemes_mod` (`psb_comm_*_`).
|
|
2. Create a module `psb_comm_<name>_mod.F90` with a type extending
|
|
`psb_comm_handle_type` and implementing the four deferred methods
|
|
(`init`, `free`, `set_swap_status`, `get_swap_status`).
|
|
3. Register it in the factory `psb_comm_set` (`psb_comm_factory_mod.F90`).
|
|
4. Add a `psi_<x>swap_<name>_vect` implementation and a `case` for it in the
|
|
`select case` of `psi_<x>swapdata_vect` for every data type, honouring the
|
|
start/wait/sync contract.
|
|
5. Wire the new module into the build (`base/modules/Makefile` /
|
|
`base/CMakeLists.txt`).
|