You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
psblas3/docs/src/cuda.tex

396 lines
10 KiB
TeX

\subsection{CUDA-class extensions}
\label{sec:cudastruct}
For computing with CUDA we define a dual memorization strategy in
which each variable on the CPU (``host'') side has a GPU (``device'')
side. When a GPU-type variable is initialized, the data contained is
(usually) the same on both sides. Each operator invoked on the
variable may change the data so that only the host side or the device
side are up-to-date.
Keeping track of the updates to data in the variables is essential: we want
to perform most computations on the GPU, but we cannot afford the time
needed to move data between the host memory and the device memory
because the bandwidth of the interconnection bus would become the main
bottleneck of the computation. Thus, each and every computational
routine in the library is built according to the following principles:
\begin{itemize}
\item If the data type being handled is {GPU}-enabled, make sure that
its device copy is up to date, perform any arithmetic operation on
the {GPU}, and if the data has been altered as a result, mark
the main-memory copy as outdated.
\item The main-memory copy is never updated unless this is requested
by the user either
\begin{description}
\item[explicitly] by invoking a synchronization method;
\item[implicitly] by invoking a method that involves other data items
that are not {GPU}-enabled, e.g., by assignment ov a vector to a
normal array.
\end{description}
\end{itemize}
In this way, data items are put on the {GPU} memory ``on demand'' and
remain there as long as ``normal'' computations are carried out.
As an example, the following call to a matrix-vector product
\ifpdf
\begin{minted}[breaklines=true,bgcolor=bg,fontsize=\small]{fortran}
call psb_spmm(alpha,a,x,beta,y,desc_a,info)
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
call psb_spmm(alpha,a,x,beta,y,desc_a,info)
\end{verbatim}
\end{minipage}
\end{center}
\fi
will transparently and automatically be performed on the {GPU} whenever
all three data inputs \fortinline|a|, \fortinline|x| and
\fortinline|y| are {GPU}-enabled. If a program makes many such calls
sequentially, then
\begin{itemize}
\item The first kernel invocation will find the data in main memory,
and will copy it to the {GPU} memory, thus incurring a significant
overhead; the result is however \emph{not} copied back, and
therefore:
\item Subsequent kernel invocations involving the same vector will
find the data on the {GPU} side so that they will run at full
speed.
\end{itemize}
For all invocations after the first the only data that will have to be
transferred to/from the main memory will be the scalars \fortinline|alpha|
and \fortinline|beta|, and the return code \fortinline|info|.
\begin{description}
\item[Vectors:] The data type \fortinline|psb_T_vect_gpu| provides a
GPU-enabled extension of the inner type \fortinline|psb_T_base_vect_type|,
and must be used together with the other inner matrix type to make
full use of the GPU computational capabilities;
\item[CSR:] The data type \fortinline|psb_T_csrg_sparse_mat| provides an
interface to the GPU version of CSR available in the NVIDIA CuSPARSE
library;
\item[HYB:] The data type \fortinline|psb_T_hybg_sparse_mat| provides an
interface to the HYB GPU storage available in the NVIDIA CuSPARSE
library. The internal structure is opaque, hence the host side is
just CSR; the HYB data format is only available up to CUDA version
10.
\item[ELL:] The data type \fortinline|psb_T_elg_sparse_mat| provides an
interface to the ELLPACK implementation from SPGPU;
\item[HLL:] The data type \fortinline|psb_T_hlg_sparse_mat| provides an
interface to the Hacked ELLPACK implementation from SPGPU;
\item[HDIA:] The data type \fortinline|psb_T_hdiag_sparse_mat| provides an
interface to the Hacked DIAgonals implementation from SPGPU;
\end{description}
\section{CUDA Environment Routines}
\label{sec:cudaenv}
\subsection*{psb\_cuda\_init --- Initializes PSBLAS-CUDA
environment}
\addcontentsline{toc}{subsection}{psb\_cuda\_init}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
call psb_cuda_init(ctxt [, device])
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
call psb_cuda_init(ctxt [, device])
\end{verbatim}
\end{minipage}
\end{center}
\fi
This subroutine initializes the PSBLAS-CUDA environment.
\begin{description}
\item[Type:] Synchronous.
\item[\bf On Entry ]
\item[device] ID of CUDA device to attach to.\\
Scope: {\bf local}.\\
Type: {\bf optional}.\\
Intent: {\bf in}.\\
Specified as: an integer value. \
Default: use \fortinline|mod(iam,ngpu)| where \fortinline|iam| is the calling
process index and \fortinline|ngpu| is the total number of CUDA devices
available on the current node.
\end{description}
{\par\noindent\large\bfseries Notes}
\begin{enumerate}
\item A call to this routine must precede any other PSBLAS-CUDA call.
\end{enumerate}
\subsection*{psb\_cuda\_exit --- Exit from PSBLAS-CUDA
environment}
\addcontentsline{toc}{subsection}{psb\_cuda\_exit}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
call psb_cuda_exit(ctxt)
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
call psb_cuda_exit(ctxt)
\end{verbatim}
\end{minipage}
\end{center}
\fi
This subroutine exits from the PSBLAS CUDA context.
\begin{description}
\item[Type:] Synchronous.
\item[\bf On Entry ]
\item[ctxt] the communication context identifying the virtual
parallel machine.\\
Scope: {\bf global}.\\
Type: {\bf required}.\\
Intent: {\bf in}.\\
Specified as: an integer variable.
\end{description}
\subsection*{psb\_cuda\_DeviceSync --- Synchronize CUDA device}
\addcontentsline{toc}{subsection}{psb\_cuda\_DeviceSync}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
call psb_cuda_DeviceSync()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
call psb_cuda_DeviceSync()
\end{verbatim}
\end{minipage}
\end{center}
\fi
This subroutine ensures that all previosly invoked kernels, i.e. all
invocation of CUDA-side code, have completed.
\subsection*{psb\_cuda\_getDeviceCount }
\addcontentsline{toc}{subsection}{psb\_cuda\_getDeviceCount}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
ngpus = psb_cuda_getDeviceCount()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
ngpus = psb_cuda_getDeviceCount()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Get number of devices available on current computing node.
\subsection*{psb\_cuda\_getDevice }
\addcontentsline{toc}{subsection}{psb\_cuda\_getDevice}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
dev = psb_cuda_getDevice()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
dev = psb_cuda_getDevice()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Get device in use by current process.
\subsection*{psb\_cuda\_setDevice }
\addcontentsline{toc}{subsection}{psb\_cuda\_setDevice}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
info = psb_cuda_setDevice(dev)
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
info = psb_cuda_setDevice(dev)
\end{verbatim}
\end{minipage}
\end{center}
\fi
Set device to be used by current process.
\subsection*{psb\_cuda\_DeviceHasUVA }
\addcontentsline{toc}{subsection}{psb\_cuda\_DeviceHasUVA}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
hasUva = psb_cuda_DeviceHasUVA()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
hasUva = psb_cuda_DeviceHasUVA()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns true if device currently in use supports UVA
(Unified Virtual Addressing).
\subsection*{psb\_cuda\_WarpSize }
\addcontentsline{toc}{subsection}{psb\_cuda\_WarpSize}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
nw = psb_cuda_WarpSize()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
nw = psb_cuda_WarpSize()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the warp size.
\subsection*{psb\_cuda\_MultiProcessors }
\addcontentsline{toc}{subsection}{psb\_cuda\_MultiProcessors}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
nmp = psb_cuda_MultiProcessors()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
nmp = psb_cuda_MultiProcessors()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the number of multiprocessors in the CUDA device.
\subsection*{psb\_cuda\_MaxThreadsPerMP }
\addcontentsline{toc}{subsection}{psb\_cuda\_MaxThreadsPerMP}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
nt = psb_cuda_MaxThreadsPerMP()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
nt = psb_cuda_MaxThreadsPerMP()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the maximum number of threads per multiprocessor.
\subsection*{psb\_cuda\_MaxRegistersPerBlock }
\addcontentsline{toc}{subsection}{psb\_cuda\_MaxRegisterPerBlock}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
nr = psb_cuda_MaxRegistersPerBlock()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
nr = psb_cuda_MaxRegistersPerBlock()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the maximum number of register per thread block.
\subsection*{psb\_cuda\_MemoryClockRate }
\addcontentsline{toc}{subsection}{psb\_cuda\_MemoryClockRate}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
cl = psb_cuda_MemoryClockRate()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
cl = psb_cuda_MemoryClockRate()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the memory clock rate in KHz, as an integer.
\subsection*{psb\_cuda\_MemoryBusWidth }
\addcontentsline{toc}{subsection}{psb\_cuda\_MemoryBusWidth}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
nb = psb_cuda_MemoryBusWidth()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
nb = psb_cuda_MemoryBusWidth()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the memory bus width in bits.
\subsection*{psb\_cuda\_MemoryPeakBandwidth }
\addcontentsline{toc}{subsection}{psb\_cuda\_MemoryPeakBandwidth}
\ifpdf
\begin{minted}[breaklines=true]{fortran}
bw = psb_cuda_MemoryPeakBandwidth()
\end{minted}
\else
\begin{center}
\begin{minipage}[tl]{0.9\textwidth}
\begin{verbatim}
bw = psb_cuda_MemoryPeakBandwidth()
\end{verbatim}
\end{minipage}
\end{center}
\fi
Returns the peak memory bandwidth in MB/s (real double precision).