\subsection{CUDA-class extensions} \label{sec:cudastruct} For computing with CUDA we define a dual memorization strategy in which each variable on the CPU (``host'') side has a GPU (``device'') side. When a GPU-type variable is initialized, the data contained is (usually) the same on both sides. Each operator invoked on the variable may change the data so that only the host side or the device side are up-to-date. Keeping track of the updates to data in the variables is essential: we want to perform most computations on the GPU, but we cannot afford the time needed to move data between the host memory and the device memory because the bandwidth of the interconnection bus would become the main bottleneck of the computation. Thus, each and every computational routine in the library is built according to the following principles: \begin{itemize} \item If the data type being handled is {GPU}-enabled, make sure that its device copy is up to date, perform any arithmetic operation on the {GPU}, and if the data has been altered as a result, mark the main-memory copy as outdated. \item The main-memory copy is never updated unless this is requested by the user either \begin{description} \item[explicitly] by invoking a synchronization method; \item[implicitly] by invoking a method that involves other data items that are not {GPU}-enabled, e.g., by assignment ov a vector to a normal array. \end{description} \end{itemize} In this way, data items are put on the {GPU} memory ``on demand'' and remain there as long as ``normal'' computations are carried out. As an example, the following call to a matrix-vector product \ifpdf \begin{minted}[breaklines=true,bgcolor=bg,fontsize=\small]{fortran} call psb_spmm(alpha,a,x,beta,y,desc_a,info) \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} call psb_spmm(alpha,a,x,beta,y,desc_a,info) \end{verbatim} \end{minipage} \end{center} \fi will transparently and automatically be performed on the {GPU} whenever all three data inputs \fortinline|a|, \fortinline|x| and \fortinline|y| are {GPU}-enabled. If a program makes many such calls sequentially, then \begin{itemize} \item The first kernel invocation will find the data in main memory, and will copy it to the {GPU} memory, thus incurring a significant overhead; the result is however \emph{not} copied back, and therefore: \item Subsequent kernel invocations involving the same vector will find the data on the {GPU} side so that they will run at full speed. \end{itemize} For all invocations after the first the only data that will have to be transferred to/from the main memory will be the scalars \fortinline|alpha| and \fortinline|beta|, and the return code \fortinline|info|. \begin{description} \item[Vectors:] The data type \fortinline|psb_T_vect_gpu| provides a GPU-enabled extension of the inner type \fortinline|psb_T_base_vect_type|, and must be used together with the other inner matrix type to make full use of the GPU computational capabilities; \item[CSR:] The data type \fortinline|psb_T_csrg_sparse_mat| provides an interface to the GPU version of CSR available in the NVIDIA CuSPARSE library; \item[HYB:] The data type \fortinline|psb_T_hybg_sparse_mat| provides an interface to the HYB GPU storage available in the NVIDIA CuSPARSE library. The internal structure is opaque, hence the host side is just CSR; the HYB data format is only available up to CUDA version 10. \item[ELL:] The data type \fortinline|psb_T_elg_sparse_mat| provides an interface to the ELLPACK implementation from SPGPU; \item[HLL:] The data type \fortinline|psb_T_hlg_sparse_mat| provides an interface to the Hacked ELLPACK implementation from SPGPU; \item[HDIA:] The data type \fortinline|psb_T_hdiag_sparse_mat| provides an interface to the Hacked DIAgonals implementation from SPGPU; \end{description} \section{CUDA Environment Routines} \label{sec:cudaenv} \subsection*{psb\_cuda\_init --- Initializes PSBLAS-CUDA environment} \addcontentsline{toc}{subsection}{psb\_cuda\_init} \ifpdf \begin{minted}[breaklines=true]{fortran} call psb_cuda_init(ctxt [, device]) \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} call psb_cuda_init(ctxt [, device]) \end{verbatim} \end{minipage} \end{center} \fi This subroutine initializes the PSBLAS-CUDA environment. \begin{description} \item[Type:] Synchronous. \item[\bf On Entry ] \item[device] ID of CUDA device to attach to.\\ Scope: {\bf local}.\\ Type: {\bf optional}.\\ Intent: {\bf in}.\\ Specified as: an integer value. \ Default: use \fortinline|mod(iam,ngpu)| where \fortinline|iam| is the calling process index and \fortinline|ngpu| is the total number of CUDA devices available on the current node. \end{description} {\par\noindent\large\bfseries Notes} \begin{enumerate} \item A call to this routine must precede any other PSBLAS-CUDA call. \end{enumerate} \subsection*{psb\_cuda\_exit --- Exit from PSBLAS-CUDA environment} \addcontentsline{toc}{subsection}{psb\_cuda\_exit} \ifpdf \begin{minted}[breaklines=true]{fortran} call psb_cuda_exit(ctxt) \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} call psb_cuda_exit(ctxt) \end{verbatim} \end{minipage} \end{center} \fi This subroutine exits from the PSBLAS CUDA context. \begin{description} \item[Type:] Synchronous. \item[\bf On Entry ] \item[ctxt] the communication context identifying the virtual parallel machine.\\ Scope: {\bf global}.\\ Type: {\bf required}.\\ Intent: {\bf in}.\\ Specified as: an integer variable. \end{description} \subsection*{psb\_cuda\_DeviceSync --- Synchronize CUDA device} \addcontentsline{toc}{subsection}{psb\_cuda\_DeviceSync} \ifpdf \begin{minted}[breaklines=true]{fortran} call psb_cuda_DeviceSync() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} call psb_cuda_DeviceSync() \end{verbatim} \end{minipage} \end{center} \fi This subroutine ensures that all previosly invoked kernels, i.e. all invocation of CUDA-side code, have completed. \subsection*{psb\_cuda\_getDeviceCount } \addcontentsline{toc}{subsection}{psb\_cuda\_getDeviceCount} \ifpdf \begin{minted}[breaklines=true]{fortran} ngpus = psb_cuda_getDeviceCount() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} ngpus = psb_cuda_getDeviceCount() \end{verbatim} \end{minipage} \end{center} \fi Get number of devices available on current computing node. \subsection*{psb\_cuda\_getDevice } \addcontentsline{toc}{subsection}{psb\_cuda\_getDevice} \ifpdf \begin{minted}[breaklines=true]{fortran} dev = psb_cuda_getDevice() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} dev = psb_cuda_getDevice() \end{verbatim} \end{minipage} \end{center} \fi Get device in use by current process. \subsection*{psb\_cuda\_setDevice } \addcontentsline{toc}{subsection}{psb\_cuda\_setDevice} \ifpdf \begin{minted}[breaklines=true]{fortran} info = psb_cuda_setDevice(dev) \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} info = psb_cuda_setDevice(dev) \end{verbatim} \end{minipage} \end{center} \fi Set device to be used by current process. \subsection*{psb\_cuda\_DeviceHasUVA } \addcontentsline{toc}{subsection}{psb\_cuda\_DeviceHasUVA} \ifpdf \begin{minted}[breaklines=true]{fortran} hasUva = psb_cuda_DeviceHasUVA() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} hasUva = psb_cuda_DeviceHasUVA() \end{verbatim} \end{minipage} \end{center} \fi Returns true if device currently in use supports UVA (Unified Virtual Addressing). \subsection*{psb\_cuda\_WarpSize } \addcontentsline{toc}{subsection}{psb\_cuda\_WarpSize} \ifpdf \begin{minted}[breaklines=true]{fortran} nw = psb_cuda_WarpSize() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} nw = psb_cuda_WarpSize() \end{verbatim} \end{minipage} \end{center} \fi Returns the warp size. \subsection*{psb\_cuda\_MultiProcessors } \addcontentsline{toc}{subsection}{psb\_cuda\_MultiProcessors} \ifpdf \begin{minted}[breaklines=true]{fortran} nmp = psb_cuda_MultiProcessors() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} nmp = psb_cuda_MultiProcessors() \end{verbatim} \end{minipage} \end{center} \fi Returns the number of multiprocessors in the CUDA device. \subsection*{psb\_cuda\_MaxThreadsPerMP } \addcontentsline{toc}{subsection}{psb\_cuda\_MaxThreadsPerMP} \ifpdf \begin{minted}[breaklines=true]{fortran} nt = psb_cuda_MaxThreadsPerMP() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} nt = psb_cuda_MaxThreadsPerMP() \end{verbatim} \end{minipage} \end{center} \fi Returns the maximum number of threads per multiprocessor. \subsection*{psb\_cuda\_MaxRegistersPerBlock } \addcontentsline{toc}{subsection}{psb\_cuda\_MaxRegisterPerBlock} \ifpdf \begin{minted}[breaklines=true]{fortran} nr = psb_cuda_MaxRegistersPerBlock() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} nr = psb_cuda_MaxRegistersPerBlock() \end{verbatim} \end{minipage} \end{center} \fi Returns the maximum number of register per thread block. \subsection*{psb\_cuda\_MemoryClockRate } \addcontentsline{toc}{subsection}{psb\_cuda\_MemoryClockRate} \ifpdf \begin{minted}[breaklines=true]{fortran} cl = psb_cuda_MemoryClockRate() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} cl = psb_cuda_MemoryClockRate() \end{verbatim} \end{minipage} \end{center} \fi Returns the memory clock rate in KHz, as an integer. \subsection*{psb\_cuda\_MemoryBusWidth } \addcontentsline{toc}{subsection}{psb\_cuda\_MemoryBusWidth} \ifpdf \begin{minted}[breaklines=true]{fortran} nb = psb_cuda_MemoryBusWidth() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} nb = psb_cuda_MemoryBusWidth() \end{verbatim} \end{minipage} \end{center} \fi Returns the memory bus width in bits. \subsection*{psb\_cuda\_MemoryPeakBandwidth } \addcontentsline{toc}{subsection}{psb\_cuda\_MemoryPeakBandwidth} \ifpdf \begin{minted}[breaklines=true]{fortran} bw = psb_cuda_MemoryPeakBandwidth() \end{minted} \else \begin{center} \begin{minipage}[tl]{0.9\textwidth} \begin{verbatim} bw = psb_cuda_MemoryPeakBandwidth() \end{verbatim} \end{minipage} \end{center} \fi Returns the peak memory bandwidth in MB/s (real double precision).