<!--l. 4--><pclass="noindent">For computing with CUDA we define a dual memorization strategy in which each
variable on the CPU (“host”) side has a GPU (“device”) side. When a GPU-type
variable is initialized, the data contained is (usually) the same on both sides. Each
@ -801,10 +812,10 @@ operator invoked on the variable may change the data so that only the host side
the device side are up-to-date.
<!--l. 11--><pclass="indent"> Keeping track of the updates to data in the variables is essential: we want to
perform most computations on the GPU, but we cannot afford the time needed to
move data between the host memory and the device memory because the bandwidth
of the interconnection bus would become the main bottleneck of the computation.
Thus, each and every computational routine in the library is built according to the
following principles:
move data between the host memory and the device memory because the
bandwidth of the interconnection bus would become the main bottleneck of the
computation. Thus, each and every computational routine in the library is built
according to the following principles:
<ulclass="itemize1">
<liclass="itemize">
<!--l. 18--><pclass="noindent">If the data type being handled is GPU-enabled, make sure that its device
@ -818,20 +829,20 @@ following principles:
<dlclass="description"><dtclass="description">
<!--l. 25--><pclass="noindent">
<span
class="cmbx-10">explicitly</span></dt><dd
class="pplb7t-">explicitly</span></dt><dd
class="description">
<!--l. 25--><pclass="noindent">by invoking a synchronization method;
</dd><dtclass="description">
<!--l. 26--><pclass="noindent">
<span
class="cmbx-10">implicitly</span></dt><dd
class="pplb7t-">implicitly</span></dt><dd
class="description">
<!--l. 26--><pclass="noindent">by invoking a method that involves other data items that are not
GPU-enabled, e.g., by assignment ov a vector to a normal array.</dd></dl>
</li></ul>
<!--l. 31--><pclass="noindent">In this way, data items are put on the GPU memory “on demand” and remain there as
long as “normal” computations are carried out. As an example, the following call to a
matrix-vector product
long as “normal” computations are carried out. As an example, the following call to
a matrix-vector product
<divclass="center"
>
<!--l. 39--><pclass="noindent">
@ -850,11 +861,11 @@ then
<!--l. 52--><pclass="noindent">The first kernel invocation will find the data in main memory, and will
copy it to the GPU memory, thus incurring a significant overhead; the
result is however <span
class="cmti-10">not </span>copied back, and therefore:
class="pplri7t-">not </span>copied back, and therefore:
</li>
<liclass="itemize">
<!--l. 56--><pclass="noindent">Subsequent kernel invocations involving the same vector will find the data
on the GPU side so that they will run at full speed.</li></ul>
<!--l. 56--><pclass="noindent">Subsequent kernel invocations involving the same vector will find the
data on the GPU side so that they will run at full speed.</li></ul>
<!--l. 60--><pclass="noindent">For all invocations after the first the only data that will have to be transferred to/from
the main memory will be the scalars <codeclass="lstinline"><spanstyle="color:#000000">alpha</span></code> and <codeclass="lstinline"><spanstyle="color:#000000">beta</span></code>, and the return code
@ -862,7 +873,7 @@ the main memory will be the scalars <code class="lstinline"><span style="color:#
<dlclass="description"><dtclass="description">
<!--l. 65--><pclass="noindent">
<span
class="cmbx-10">Vectors:</span></dt><dd
class="pplb7t-">Vectors:</span></dt><dd
class="description">
<!--l. 65--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_vect_gpu</span></code> provides a GPU-enabled extension of
the inner type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_base_vect_type</span></code>, and must be used together with
@ -871,23 +882,23 @@ class="description">
</dd><dtclass="description">
<!--l. 69--><pclass="noindent">
<span
class="cmbx-10">CSR:</span></dt><dd
class="pplb7t-">CSR:</span></dt><dd
class="description">
<!--l. 69--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_csrg_sparse_mat</span></code> provides an interface to the GPU
version of CSR available in the NVIDIA CuSPARSE library;
</dd><dtclass="description">
<!--l. 72--><pclass="noindent">
<span
class="cmbx-10">HYB:</span></dt><dd
class="pplb7t-">HYB:</span></dt><dd
class="description">
<!--l. 72--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hybg_sparse_mat</span></code> provides an interface to the HYB
GPU storage available in the NVIDIA CuSPARSE library. The internal
GPU storage available in the NVIDIA CuSPARSE library. The internal
structure is opaque, hence the host side is just CSR; the HYB data format
is only available up to CUDA version 10.
</dd><dtclass="description">
<!--l. 77--><pclass="noindent">
<span
class="cmbx-10">ELL:</span></dt><dd
class="pplb7t-">ELL:</span></dt><dd
class="description">
<!--l. 77--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_elg_sparse_mat</span></code> provides an interface to the
ELLPACK implementation from SPGPU;
@ -897,14 +908,14 @@ class="description">
</dd><dtclass="description">
<!--l. 80--><pclass="noindent">
<span
class="cmbx-10">HLL:</span></dt><dd
class="pplb7t-">HLL:</span></dt><dd
class="description">
<!--l. 80--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hlg_sparse_mat</span></code> provides an interface to the Hacked
ELLPACK implementation from SPGPU;
<!--l. 80--><pclass="noindent">The data type<codeclass="lstinline"><spanstyle="color:#000000">psb_T_hlg_sparse_mat</span></code> provides an interface to the
Hacked ELLPACK implementation from SPGPU;
</dd><dtclass="description">
<!--l. 82--><pclass="noindent">
<span
class="cmbx-10">HDIA:</span></dt><dd
class="pplb7t-">HDIA:</span></dt><dd
class="description">
<!--l. 82--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hdiag_sparse_mat</span></code> provides an interface to the
Hacked DIAgonals implementation from SPGPU;</dd></dl>
<!--l. 112--><pclass="noindent">ID of CUDA device to attach to.<br
class="newline" />Scope: <span
class="cmbx-10">local</span>.<br
class="pplb7t-">local</span>.<br
class="newline" />Type: <span
class="cmbx-10">optional</span>.<br
class="pplb7t-">optional</span>.<br
class="newline" />Intent: <span
class="cmbx-10">in</span>.<br
class="pplb7t-">in</span>.<br
class="newline" />Specified as: an integer value.  Default: use <codeclass="lstinline"><spanstyle="color:#000000">mod</span><spanstyle="color:#000000">(</span><spanstyle="color:#000000">iam</span><spanstyle="color:#000000">,</span><spanstyle="color:#000000">ngpu</span><spanstyle="color:#000000">)</span></code> where <codeclass="lstinline"><spanstyle="color:#000000">iam</span></code> is
the calling process index and <codeclass="lstinline"><spanstyle="color:#000000">ngpu</span></code> is the total number of CUDA devices
available on the current node.</dd></dl>
<!--l. 123--><pclass="noindent"><span
class="cmbx-12">Notes</span>
class="pplb7t-x-x-120">Notes</span>
<olclass="enumerate1">
<li
class="enumerate" id="x20-155002x1">
class="enumerate" id="x20-156002x1">
<!--l. 125--><pclass="noindent">A call to this routine must precede any other PSBLAS-CUDA call.</li></ol>
<!--l. 129--><pclass="noindent">
<h4class="likesubsectionHead"><a
id="x20-156000"></a>psb_cuda_exit — Exit from PSBLAS-CUDA environment</h4>
id="x20-157000"></a>psb_cuda_exit — Exit from PSBLAS-CUDA environment</h4>
class="newline" />An integer value; 0 means no error has been detected.</dd></dl>
<!--l. 209--><pclass="noindent"><span
class="cmbx-12">Notes</span>
class="pplb7t-x-x-120">Notes</span>
<!--l. 211--><pclass="indent"> If this function is called on a matrix <codeclass="lstinline"><spanstyle="color:#000000">a</span></code> on a distributed communicator only the
local part is written in output. To get a single MatrixMarket file with the whole
matrix when appropriate, e.g. for debugging purposes, one could <span
class="cmti-10">gather </span>the whole
class="pplri7t-">gather </span>the whole
matrix on a single rank and then write it. Consider the following example for a
<!--l. 282--><pclass="noindent">The Fortran file unit number.<br
class="newline" />Type:<span
class="cmbx-10">optional</span>.<br
class="pplb7t-">optional</span>.<br
class="newline" />Specified as: an integer value. Only meaningful if filename is not <spanclass="obeylines-h"><spanclass="verb"><span
class="cmtt-10">-</span></span></span>.</dd></dl>
<!--l. 287--><pclass="noindent">
<dlclass="description"><dtclass="description">
<!--l. 288--><pclass="noindent">
<span
class="cmbx-10">On Return</span></dt><dd
class="pplb7t-">On Return</span></dt><dd
class="description">
<!--l. 288--><pclass="noindent">
</dd><dtclass="description">
<!--l. 289--><pclass="noindent">
<span
class="cmbx-10">iret</span></dt><dd
class="pplb7t-">iret</span></dt><dd
class="description">
<!--l. 289--><pclass="noindent">Error code.<br
class="newline" />Type: <span
class="cmbx-10">required </span><br
class="pplb7t-">required </span><br
class="newline" />An integer value; 0 means no error has been detected.</dd></dl>
<!--l. 294--><pclass="noindent"><span
class="cmbx-12">Notes</span>
class="pplb7t-x-x-120">Notes</span>
<!--l. 296--><pclass="indent"> If this function is called on a vector <codeclass="lstinline"><spanstyle="color:#000000">v</span></code> on a distributed communicator only the
local part is written in output. To get a single MatrixMarket file with the whole
vector when appropriate, e.g. for debugging purposes, one could <span
class="cmti-10">gather </span>the whole
class="pplri7t-">gather </span>the whole
vector on a single rank and then write it. Consider the following example for a <span