<!--l. 4--><pclass="noindent">For computing with CUDA we define a dual memorization strategy in which each
<!--l. 4--><pclass="noindent">For computing with CUDA we define a dual memorization strategy in which each
variable on the CPU (“host”) side has a GPU (“device”) side. When a GPU-type
variable on the CPU (“host”) side has a GPU (“device”) side. When a GPU-type
variable is initialized, the data contained is (usually) the same on both sides. Each
variable is initialized, the data contained is (usually) the same on both sides. Each
@ -801,10 +812,10 @@ operator invoked on the variable may change the data so that only the host side
the device side are up-to-date.
the device side are up-to-date.
<!--l. 11--><pclass="indent"> Keeping track of the updates to data in the variables is essential: we want to
<!--l. 11--><pclass="indent"> Keeping track of the updates to data in the variables is essential: we want to
perform most computations on the GPU, but we cannot afford the time needed to
perform most computations on the GPU, but we cannot afford the time needed to
move data between the host memory and the device memory because the bandwidth
move data between the host memory and the device memory because the
of the interconnection bus would become the main bottleneck of the computation.
bandwidth of the interconnection bus would become the main bottleneck of the
Thus, each and every computational routine in the library is built according to the
computation. Thus, each and every computational routine in the library is built
following principles:
according to the following principles:
<ulclass="itemize1">
<ulclass="itemize1">
<liclass="itemize">
<liclass="itemize">
<!--l. 18--><pclass="noindent">If the data type being handled is GPU-enabled, make sure that its device
<!--l. 18--><pclass="noindent">If the data type being handled is GPU-enabled, make sure that its device
@ -818,20 +829,20 @@ following principles:
<dlclass="description"><dtclass="description">
<dlclass="description"><dtclass="description">
<!--l. 25--><pclass="noindent">
<!--l. 25--><pclass="noindent">
<span
<span
class="cmbx-10">explicitly</span></dt><dd
class="pplb7t-">explicitly</span></dt><dd
class="description">
class="description">
<!--l. 25--><pclass="noindent">by invoking a synchronization method;
<!--l. 25--><pclass="noindent">by invoking a synchronization method;
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 26--><pclass="noindent">
<!--l. 26--><pclass="noindent">
<span
<span
class="cmbx-10">implicitly</span></dt><dd
class="pplb7t-">implicitly</span></dt><dd
class="description">
class="description">
<!--l. 26--><pclass="noindent">by invoking a method that involves other data items that are not
<!--l. 26--><pclass="noindent">by invoking a method that involves other data items that are not
GPU-enabled, e.g., by assignment ov a vector to a normal array.</dd></dl>
GPU-enabled, e.g., by assignment ov a vector to a normal array.</dd></dl>
</li></ul>
</li></ul>
<!--l. 31--><pclass="noindent">In this way, data items are put on the GPU memory “on demand” and remain there as
<!--l. 31--><pclass="noindent">In this way, data items are put on the GPU memory “on demand” and remain there as
long as “normal” computations are carried out. As an example, the following call to a
long as “normal” computations are carried out. As an example, the following call to
matrix-vector product
a matrix-vector product
<divclass="center"
<divclass="center"
>
>
<!--l. 39--><pclass="noindent">
<!--l. 39--><pclass="noindent">
@ -850,11 +861,11 @@ then
<!--l. 52--><pclass="noindent">The first kernel invocation will find the data in main memory, and will
<!--l. 52--><pclass="noindent">The first kernel invocation will find the data in main memory, and will
copy it to the GPU memory, thus incurring a significant overhead; the
copy it to the GPU memory, thus incurring a significant overhead; the
result is however <span
result is however <span
class="cmti-10">not </span>copied back, and therefore:
class="pplri7t-">not </span>copied back, and therefore:
</li>
</li>
<liclass="itemize">
<liclass="itemize">
<!--l. 56--><pclass="noindent">Subsequent kernel invocations involving the same vector will find the data
<!--l. 56--><pclass="noindent">Subsequent kernel invocations involving the same vector will find the
on the GPU side so that they will run at full speed.</li></ul>
data on the GPU side so that they will run at full speed.</li></ul>
<!--l. 60--><pclass="noindent">For all invocations after the first the only data that will have to be transferred to/from
<!--l. 60--><pclass="noindent">For all invocations after the first the only data that will have to be transferred to/from
the main memory will be the scalars <codeclass="lstinline"><spanstyle="color:#000000">alpha</span></code> and <codeclass="lstinline"><spanstyle="color:#000000">beta</span></code>, and the return code
the main memory will be the scalars <codeclass="lstinline"><spanstyle="color:#000000">alpha</span></code> and <codeclass="lstinline"><spanstyle="color:#000000">beta</span></code>, and the return code
@ -862,7 +873,7 @@ the main memory will be the scalars <code class="lstinline"><span style="color:#
<dlclass="description"><dtclass="description">
<dlclass="description"><dtclass="description">
<!--l. 65--><pclass="noindent">
<!--l. 65--><pclass="noindent">
<span
<span
class="cmbx-10">Vectors:</span></dt><dd
class="pplb7t-">Vectors:</span></dt><dd
class="description">
class="description">
<!--l. 65--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_vect_gpu</span></code> provides a GPU-enabled extension of
<!--l. 65--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_vect_gpu</span></code> provides a GPU-enabled extension of
the inner type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_base_vect_type</span></code>, and must be used together with
the inner type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_base_vect_type</span></code>, and must be used together with
@ -871,14 +882,14 @@ class="description">
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 69--><pclass="noindent">
<!--l. 69--><pclass="noindent">
<span
<span
class="cmbx-10">CSR:</span></dt><dd
class="pplb7t-">CSR:</span></dt><dd
class="description">
class="description">
<!--l. 69--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_csrg_sparse_mat</span></code> provides an interface to the GPU
<!--l. 69--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_csrg_sparse_mat</span></code> provides an interface to the GPU
version of CSR available in the NVIDIA CuSPARSE library;
version of CSR available in the NVIDIA CuSPARSE library;
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 72--><pclass="noindent">
<!--l. 72--><pclass="noindent">
<span
<span
class="cmbx-10">HYB:</span></dt><dd
class="pplb7t-">HYB:</span></dt><dd
class="description">
class="description">
<!--l. 72--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hybg_sparse_mat</span></code> provides an interface to the HYB
<!--l. 72--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hybg_sparse_mat</span></code> provides an interface to the HYB
GPU storage available in the NVIDIA CuSPARSE library. The internal
GPU storage available in the NVIDIA CuSPARSE library. The internal
@ -887,7 +898,7 @@ class="description">
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 77--><pclass="noindent">
<!--l. 77--><pclass="noindent">
<span
<span
class="cmbx-10">ELL:</span></dt><dd
class="pplb7t-">ELL:</span></dt><dd
class="description">
class="description">
<!--l. 77--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_elg_sparse_mat</span></code> provides an interface to the
<!--l. 77--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_elg_sparse_mat</span></code> provides an interface to the
ELLPACK implementation from SPGPU;
ELLPACK implementation from SPGPU;
@ -897,14 +908,14 @@ class="description">
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 80--><pclass="noindent">
<!--l. 80--><pclass="noindent">
<span
<span
class="cmbx-10">HLL:</span></dt><dd
class="pplb7t-">HLL:</span></dt><dd
class="description">
class="description">
<!--l. 80--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hlg_sparse_mat</span></code> provides an interface to the Hacked
<!--l. 80--><pclass="noindent">The data type<codeclass="lstinline"><spanstyle="color:#000000">psb_T_hlg_sparse_mat</span></code> provides an interface to the
ELLPACK implementation from SPGPU;
Hacked ELLPACK implementation from SPGPU;
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 82--><pclass="noindent">
<!--l. 82--><pclass="noindent">
<span
<span
class="cmbx-10">HDIA:</span></dt><dd
class="pplb7t-">HDIA:</span></dt><dd
class="description">
class="description">
<!--l. 82--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hdiag_sparse_mat</span></code> provides an interface to the
<!--l. 82--><pclass="noindent">The data type <codeclass="lstinline"><spanstyle="color:#000000">psb_T_hdiag_sparse_mat</span></code> provides an interface to the
Hacked DIAgonals implementation from SPGPU;</dd></dl>
Hacked DIAgonals implementation from SPGPU;</dd></dl>
<!--l. 112--><pclass="noindent">ID of CUDA device to attach to.<br
<!--l. 112--><pclass="noindent">ID of CUDA device to attach to.<br
class="newline" />Scope: <span
class="newline" />Scope: <span
class="cmbx-10">local</span>.<br
class="pplb7t-">local</span>.<br
class="newline" />Type: <span
class="newline" />Type: <span
class="cmbx-10">optional</span>.<br
class="pplb7t-">optional</span>.<br
class="newline" />Intent: <span
class="newline" />Intent: <span
class="cmbx-10">in</span>.<br
class="pplb7t-">in</span>.<br
class="newline" />Specified as: an integer value.  Default: use <codeclass="lstinline"><spanstyle="color:#000000">mod</span><spanstyle="color:#000000">(</span><spanstyle="color:#000000">iam</span><spanstyle="color:#000000">,</span><spanstyle="color:#000000">ngpu</span><spanstyle="color:#000000">)</span></code> where <codeclass="lstinline"><spanstyle="color:#000000">iam</span></code> is
class="newline" />Specified as: an integer value.  Default: use <codeclass="lstinline"><spanstyle="color:#000000">mod</span><spanstyle="color:#000000">(</span><spanstyle="color:#000000">iam</span><spanstyle="color:#000000">,</span><spanstyle="color:#000000">ngpu</span><spanstyle="color:#000000">)</span></code> where <codeclass="lstinline"><spanstyle="color:#000000">iam</span></code> is
the calling process index and <codeclass="lstinline"><spanstyle="color:#000000">ngpu</span></code> is the total number of CUDA devices
the calling process index and <codeclass="lstinline"><spanstyle="color:#000000">ngpu</span></code> is the total number of CUDA devices
available on the current node.</dd></dl>
available on the current node.</dd></dl>
<!--l. 123--><pclass="noindent"><span
<!--l. 123--><pclass="noindent"><span
class="cmbx-12">Notes</span>
class="pplb7t-x-x-120">Notes</span>
<olclass="enumerate1">
<olclass="enumerate1">
<li
<li
class="enumerate" id="x20-155002x1">
class="enumerate" id="x20-156002x1">
<!--l. 125--><pclass="noindent">A call to this routine must precede any other PSBLAS-CUDA call.</li></ol>
<!--l. 125--><pclass="noindent">A call to this routine must precede any other PSBLAS-CUDA call.</li></ol>
<!--l. 129--><pclass="noindent">
<!--l. 129--><pclass="noindent">
<h4class="likesubsectionHead"><a
<h4class="likesubsectionHead"><a
id="x20-156000"></a>psb_cuda_exit — Exit from PSBLAS-CUDA environment</h4>
id="x20-157000"></a>psb_cuda_exit — Exit from PSBLAS-CUDA environment</h4>
class="newline" />An integer value; 0 means no error has been detected.</dd></dl>
class="newline" />An integer value; 0 means no error has been detected.</dd></dl>
<!--l. 209--><pclass="noindent"><span
<!--l. 209--><pclass="noindent"><span
class="cmbx-12">Notes</span>
class="pplb7t-x-x-120">Notes</span>
<!--l. 211--><pclass="indent"> If this function is called on a matrix <codeclass="lstinline"><spanstyle="color:#000000">a</span></code> on a distributed communicator only the
<!--l. 211--><pclass="indent"> If this function is called on a matrix <codeclass="lstinline"><spanstyle="color:#000000">a</span></code> on a distributed communicator only the
local part is written in output. To get a single MatrixMarket file with the whole
local part is written in output. To get a single MatrixMarket file with the whole
matrix when appropriate, e.g. for debugging purposes, one could <span
matrix when appropriate, e.g. for debugging purposes, one could <span
class="cmti-10">gather </span>the whole
class="pplri7t-">gather </span>the whole
matrix on a single rank and then write it. Consider the following example for a
matrix on a single rank and then write it. Consider the following example for a
<!--l. 282--><pclass="noindent">The Fortran file unit number.<br
<!--l. 282--><pclass="noindent">The Fortran file unit number.<br
class="newline" />Type:<span
class="newline" />Type:<span
class="cmbx-10">optional</span>.<br
class="pplb7t-">optional</span>.<br
class="newline" />Specified as: an integer value. Only meaningful if filename is not <spanclass="obeylines-h"><spanclass="verb"><span
class="newline" />Specified as: an integer value. Only meaningful if filename is not <spanclass="obeylines-h"><spanclass="verb"><span
class="cmtt-10">-</span></span></span>.</dd></dl>
class="cmtt-10">-</span></span></span>.</dd></dl>
<!--l. 287--><pclass="noindent">
<!--l. 287--><pclass="noindent">
<dlclass="description"><dtclass="description">
<dlclass="description"><dtclass="description">
<!--l. 288--><pclass="noindent">
<!--l. 288--><pclass="noindent">
<span
<span
class="cmbx-10">On Return</span></dt><dd
class="pplb7t-">On Return</span></dt><dd
class="description">
class="description">
<!--l. 288--><pclass="noindent">
<!--l. 288--><pclass="noindent">
</dd><dtclass="description">
</dd><dtclass="description">
<!--l. 289--><pclass="noindent">
<!--l. 289--><pclass="noindent">
<span
<span
class="cmbx-10">iret</span></dt><dd
class="pplb7t-">iret</span></dt><dd
class="description">
class="description">
<!--l. 289--><pclass="noindent">Error code.<br
<!--l. 289--><pclass="noindent">Error code.<br
class="newline" />Type: <span
class="newline" />Type: <span
class="cmbx-10">required </span><br
class="pplb7t-">required </span><br
class="newline" />An integer value; 0 means no error has been detected.</dd></dl>
class="newline" />An integer value; 0 means no error has been detected.</dd></dl>
<!--l. 294--><pclass="noindent"><span
<!--l. 294--><pclass="noindent"><span
class="cmbx-12">Notes</span>
class="pplb7t-x-x-120">Notes</span>
<!--l. 296--><pclass="indent"> If this function is called on a vector <codeclass="lstinline"><spanstyle="color:#000000">v</span></code> on a distributed communicator only the
<!--l. 296--><pclass="indent"> If this function is called on a vector <codeclass="lstinline"><spanstyle="color:#000000">v</span></code> on a distributed communicator only the
local part is written in output. To get a single MatrixMarket file with the whole
local part is written in output. To get a single MatrixMarket file with the whole
vector when appropriate, e.g. for debugging purposes, one could <span
vector when appropriate, e.g. for debugging purposes, one could <span
class="cmti-10">gather </span>the whole
class="pplri7t-">gather </span>the whole
vector on a single rank and then write it. Consider the following example for a <span
vector on a single rank and then write it. Consider the following example for a <span