\subsection{CUDA-class extensions}
\label{sec:cudastruct}
For computing  with CUDA we define a dual memorization strategy in
which each variable on the CPU (``host'') side has a GPU (``device'')
side. When a GPU-type variable is initialized, the data contained is
(usually) the same on both sides. Each operator invoked on the
variable may change the data so that only the host side or the device
side are up-to-date. 

Keeping track of the updates to data in the variables  is essential: we want
to perform most  computations on the GPU, but we cannot afford the time
needed to move data between the host  memory and the device memory
because the bandwidth of the interconnection bus would become the main
bottleneck of the computation. Thus, each and every computational
routine in the library is built according to the following principles: 
\begin{itemize}
\item If the data type being handled is {GPU}-enabled, make sure that
  its device copy is up to date, perform any arithmetic operation on
  the {GPU}, and if the data has been altered as a result, mark
  the main-memory copy as outdated.
\item The main-memory copy is never updated unless this is requested
  by the user either 
\begin{description}
\item[explicitly] by invoking a synchronization method;
\item[implicitly] by invoking a method that involves other data items
  that are not {GPU}-enabled, e.g., by assignment ov a vector to a
  normal array. 
\end{description}
\end{itemize}
In this way, data items are put on the {GPU} memory ``on demand'' and
remain there as long as ``normal'' computations are carried out. 
As an example, the following call to a matrix-vector product
\ifpdf
\begin{minted}[breaklines=true,bgcolor=bg,fontsize=\small]{fortran}
    call psb_spmm(alpha,a,x,beta,y,desc_a,info)
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
    call psb_spmm(alpha,a,x,beta,y,desc_a,info)
\end{verbatim}
    \end{minipage}
  \end{center}
\fi
will transparently and automatically be performed on the {GPU} whenever
all three data inputs \fortinline|a|, \fortinline|x|  and
\fortinline|y| are {GPU}-enabled. If a program makes many such calls
sequentially, then 
\begin{itemize}
\item The first kernel invocation will find the data in main memory,
  and will copy it to the {GPU} memory, thus incurring a significant
  overhead; the result is however \emph{not} copied back, and
  therefore:
\item Subsequent kernel invocations involving the same vector will
  find the data on the {GPU} side so that they will run at full
  speed.
\end{itemize}
For all invocations after the first the only data that will have to be
transferred to/from the main memory will be the scalars \fortinline|alpha|
and \fortinline|beta|, and the return code \fortinline|info|.  

\begin{description}
\item[Vectors:] The data type \fortinline|psb_T_vect_gpu| provides a
  GPU-enabled extension of the inner type \fortinline|psb_T_base_vect_type|,
  and must be used together with the other inner matrix type to make
  full use of the GPU computational capabilities;
\item[CSR:] The data type \fortinline|psb_T_csrg_sparse_mat| provides an
  interface to the GPU version of CSR available in the NVIDIA CuSPARSE
  library;
\item[HYB:] The data type \fortinline|psb_T_hybg_sparse_mat| provides an
  interface to the HYB GPU storage  available in the NVIDIA CuSPARSE
  library. The internal structure is opaque, hence the host side is
  just CSR; the HYB data format is only available up to CUDA version
  10. 
\item[ELL:] The data type \fortinline|psb_T_elg_sparse_mat| provides an
  interface to the  ELLPACK implementation from SPGPU;

\item[HLL:] The data type \fortinline|psb_T_hlg_sparse_mat| provides an
  interface to the  Hacked ELLPACK implementation from SPGPU;
\item[HDIA:] The data type \fortinline|psb_T_hdiag_sparse_mat| provides an
  interface to the  Hacked DIAgonals implementation from SPGPU;
\end{description}


\section{CUDA Environment Routines}
\label{sec:cudaenv}

\subsection*{psb\_cuda\_init --- Initializes PSBLAS-CUDA
  environment}
\addcontentsline{toc}{subsection}{psb\_cuda\_init}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
call psb_cuda_init(ctxt [, device])
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
call psb_cuda_init(ctxt [, device])
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

This subroutine initializes the PSBLAS-CUDA  environment. 
\begin{description}
\item[Type:] Synchronous.
\item[\bf  On Entry ]
\item[device] ID of CUDA device to attach to.\\
Scope: {\bf local}.\\
Type: {\bf optional}.\\
Intent: {\bf in}.\\
Specified as: an integer value. \
Default: use \fortinline|mod(iam,ngpu)| where \fortinline|iam| is the calling
process index and \fortinline|ngpu| is the total number of CUDA devices
available on the current node. 
\end{description}


{\par\noindent\large\bfseries Notes}
\begin{enumerate}
\item A call to this routine must precede any other PSBLAS-CUDA call. 
\end{enumerate}

\subsection*{psb\_cuda\_exit --- Exit from  PSBLAS-CUDA
  environment}
\addcontentsline{toc}{subsection}{psb\_cuda\_exit}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
call psb_cuda_exit(ctxt)
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
call psb_cuda_exit(ctxt)
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

This subroutine exits from the  PSBLAS CUDA context.
\begin{description}
\item[Type:] Synchronous.
\item[\bf  On Entry ]
\item[ctxt] the communication context identifying the virtual
  parallel machine.\\
Scope: {\bf global}.\\
Type: {\bf required}.\\
Intent: {\bf in}.\\
Specified as: an integer variable.
\end{description}


\subsection*{psb\_cuda\_DeviceSync ---  Synchronize CUDA device}
\addcontentsline{toc}{subsection}{psb\_cuda\_DeviceSync}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
call psb_cuda_DeviceSync()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
call psb_cuda_DeviceSync()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

This subroutine ensures that all previosly invoked kernels, i.e. all
invocation of CUDA-side code, have completed.


\subsection*{psb\_cuda\_getDeviceCount }
\addcontentsline{toc}{subsection}{psb\_cuda\_getDeviceCount}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
ngpus =  psb_cuda_getDeviceCount()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
ngpus =  psb_cuda_getDeviceCount()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Get number of devices available on current computing node. 

\subsection*{psb\_cuda\_getDevice }
\addcontentsline{toc}{subsection}{psb\_cuda\_getDevice}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
dev =  psb_cuda_getDevice()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
dev =  psb_cuda_getDevice()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Get  device in use by current process. 

\subsection*{psb\_cuda\_setDevice }
\addcontentsline{toc}{subsection}{psb\_cuda\_setDevice}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
info = psb_cuda_setDevice(dev)
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
info = psb_cuda_setDevice(dev)
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Set  device to be used  by current process. 

\subsection*{psb\_cuda\_DeviceHasUVA }
\addcontentsline{toc}{subsection}{psb\_cuda\_DeviceHasUVA}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
hasUva = psb_cuda_DeviceHasUVA()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
hasUva = psb_cuda_DeviceHasUVA()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns true if device currently in use supports UVA
(Unified Virtual Addressing).

\subsection*{psb\_cuda\_WarpSize }
\addcontentsline{toc}{subsection}{psb\_cuda\_WarpSize}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
nw = psb_cuda_WarpSize()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
nw = psb_cuda_WarpSize()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns the warp size.


\subsection*{psb\_cuda\_MultiProcessors }
\addcontentsline{toc}{subsection}{psb\_cuda\_MultiProcessors}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
nmp = psb_cuda_MultiProcessors()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
nmp = psb_cuda_MultiProcessors()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns the number of multiprocessors in the CUDA device.

\subsection*{psb\_cuda\_MaxThreadsPerMP }
\addcontentsline{toc}{subsection}{psb\_cuda\_MaxThreadsPerMP}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
nt = psb_cuda_MaxThreadsPerMP()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
nt = psb_cuda_MaxThreadsPerMP()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns the maximum number of threads per multiprocessor. 


\subsection*{psb\_cuda\_MaxRegistersPerBlock }
\addcontentsline{toc}{subsection}{psb\_cuda\_MaxRegisterPerBlock}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
nr = psb_cuda_MaxRegistersPerBlock()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
nr = psb_cuda_MaxRegistersPerBlock()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns the maximum number of register per thread block. 


\subsection*{psb\_cuda\_MemoryClockRate }
\addcontentsline{toc}{subsection}{psb\_cuda\_MemoryClockRate}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
cl = psb_cuda_MemoryClockRate()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
cl = psb_cuda_MemoryClockRate()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns the memory clock rate in KHz, as an integer. 

\subsection*{psb\_cuda\_MemoryBusWidth }
\addcontentsline{toc}{subsection}{psb\_cuda\_MemoryBusWidth}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
nb = psb_cuda_MemoryBusWidth()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
nb = psb_cuda_MemoryBusWidth()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi

Returns the memory bus width in bits.

\subsection*{psb\_cuda\_MemoryPeakBandwidth }
\addcontentsline{toc}{subsection}{psb\_cuda\_MemoryPeakBandwidth}

\ifpdf
\begin{minted}[breaklines=true]{fortran}
bw = psb_cuda_MemoryPeakBandwidth()
\end{minted}
\else
\begin{center}
    \begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim} 
bw = psb_cuda_MemoryPeakBandwidth()
\end{verbatim}
    \end{minipage}
  \end{center}
\fi
Returns the peak memory bandwidth in MB/s (real double precision).