文件大小: 2mb
下载次数: 0
上传时间: 2019-10-08
详细说明:cuda9.0里面的cublas文档,对于cuda开发从业人士非常必要基础的一本小册子,不可错过Chapter 1
The cuBlaS library is an implementation of BLAS(Basic Linear Algebra Subprograms
on top of the NvIDIACUDA runtime It allows the user to access the computational
resources of NVIDIA Graphics Processing Unit(GPU)
Starting with CUDA 6.0, the cuBLAS Library now exposes two sets of API, the regular
cuBLAS API which is simply called cuBLAS API in this document and the CUBLASXT
To use the cuBlAS APL, the application must allocate the required matrices and vectors
in the gpu memory space fill them with data, call the sequence of desired cuBlas
functions, and then upload the results from the GPU memory space back to the host
The cuBLAS API also provides helper functions for writing and retrieving data from the
To use the CUBLASXT API, the application must keep the data on the Host and the
Library will take care of dispatching the operation to one or multiple gPUS present in
the system, depending on the user request
1. Data layout
For maximum compatibility with existing Fortran environments, the cuBLAS library
uses column-major storage, and 1-based indexing. Since C and C++ use row-major
storage, applications written in these languages can not use the native array semantics
for two-dimensional arrays. Instead, macros or inline functions should be defined to
implement matrices on top of one-dimensional arrays For Fortran code ported to C
in mechanical fashion one may chose to retain 1-based indexing to avoid the need to
transform loops. In this case the array index of a matrix element in row i" and column
can be computed via the following macro
# define IDX2F(i,j;1d)((((j)-1)*(1d))+((=)-1))
Here, Id refers to the leading dimension of the matrix, which in the case of column-major
storage is the number of rows of the allocated matrix(even if only a submatrix of it is
being used). For natively written C and C++ code, one would most likely choose O-based
cuBLAS Library
indexing, in which case the array index of a matrix clement in row i"and column"j
can be computed via the following macro
tdefine IDX2C (i,j, ld)((()*(ld))+(i))
2. New and Legacy cuBLAS AP
Starting with version 4.0, the cuBLAS Library provides a new updated APl, in addition
to the existing legacy API. This section discusses why a new API is provided, the
advantages of using it, and the differences with the existing legacy API
The new cublaS library api can be used by including the header file"cublas v2.h".It
has the following features that the legacy cublas api does not have:
the handle to the cuBLAS library context is initialized using the function and is
explicitly passed to every subsequent library function call. This allows the user to
have more control over the library setup when using multiple host threads and
multiple gPus. This also allows the CuBLaS APIs to be reentrant
the scalars a and B can be passed by reference on the host or the device, instead of
only being allowed to be passed by value on the host. This change allows library
functions to execute asynchronously using streams even when a and B are generated
by a previous kernel
when a library routine returns a scalar result, it can be returned by reference on
the host or the device, instead of only being allowed to be returned by value only
on the host This change allows library routines to be called asynchronously when
the scalar result is generated and returned by reference on the device resulting in
maximum parallelism
the error status cublas status t is returned by all cublas library function calls
This change facilitates debugging and simplifies softwarc devclopmcnt. Note that
cublasstatus was renamed cublasstatus t to be more consistent with other
types in the Cublas library
the cublasAlloc ()and cublasFree()functions have been deprecated
This change removes these unnecessary wrappers around cudaMalloc()and
cudaFree(), respectivcly
the function cublasSetKernelStream() was renamed cublas Setstream to be
more consistent with the other Cuda libraries
The legacy CuBLAS API, explained in more detail in the Appendix A, can be used by
including the header file"cublas. h". Since the legacy aPi is identical to the previously
released cuBlAS library APl, existing applications will work out of the box and
automatically use this legacy API without any source code changes. In general, new
applications should not use the legacy CuBLAS API, and existing existing applications
should convert to using the new API if it requires sophisticated and optimal stream
parallelism or if it calls CubLAS routines concurrently from multiple threads. For the rest
of the document, the new cuBLas Library api will simply be referred to as the cuBlas
Library API
As mentioned earlier the interfaces to the legacy and the cublas library apis are the
header file"cublas. h"and"cublas_v2. h",respectively. In addition, applications using
the cuBlas library need to link against the DSo cublas.So(Linux), the DLL cublas. dll
cuBLAS Library
(Windows), or the dynamic library cublas dylib(Mac Os X). Notc: the same dynamic
library implements both the new and legacy CuBLAS APIs
1.3. EXample code
For sample code references please see the two examples below. They show an
application written in C using the cubLAS library aPI with two indexing styles
cuBLAS Library
(Examplc 1. Application Using C and CUBLAS: 1-bascd indexing"and Examplc 2
Application Using C and CUBLAS: O-based Indexing")
//Example 1. App-ication Using c ard CUBLAS: 1-based indexing
include " cublas v2.h
f define M 6
t define N 5
static irline void modify (cublasHand-e t handle, flcat *m, -nt ldm, int
t p, int float alpha, float beta
cublassscal(handle, n-p+l, &alpha, &n[IDX2F(p, g, ldm)], ldm)
cublassscal (handle, ldm-p+l, &beta, &m[IDX2F(p, q, ldm)], 1)
int main (void)
cudaerror t cudastat
cublasstatus t stat
ublasFandle t handle
float* devptrAi
float* a =0
float *)malloc (M * n* sizeof (xa))
printf (host memory allocation failcd")i
a[IDX2F(i,j, M)]=(float)((1-1)*M+j)
daMa lloc ((void**)&devptrA, M*N*sizeof(*a))
if(cudastat ! cudaSuccess)
intf ("de
emory allocation failed")
turn eX工 T FAILURE
printf ("CUBLAs initialization fa:led\n)i
return EXIT FAIlure
stat cublasSctMatrix (M, N, sizeof(xa) a, M, devptrA M)
if (stat
print ("data down load failed")
cudafree (devptrA)i
cublasDestroy (andle)i
return EX工 T FAILURE
modify (hand e, devptrA, M, N,2,3,=6.0f, 120f)i
stat cub l asgetMatrix (M, n, sizeof(xa) devptrA, m, a, m)
printf ("data upload failed")
cudafree (devptrA)i
cublasDestroy (handle)i
cudafree (devptrA)i
cublas Destroy(handle)
for (i= lii
print= (67 Of", a[IDX2F(i,3, M)]);
printf (w\n)i
/Example 2. App-ication Using C ard CUBLAS: 0-based indexing
Chapter 2.
2. 1. General description
This section describes how to use the cublas library apl. it does not contain a detailed
reference for all API datatypes and functions-those are provided in subsequent chapters
The Legacy cuBLAS API is also not covered in this section-that is handled in an
2.1.1. Error status
All CuBlas library function calls return the error status cublasstatus t
2.1.2. CUBLAS context
The application must initialize the handle to the cublas library context by calling the
cublascreate()function. Then, the is explicitly passed to every subsequent library
function call. Once the application finishes using the library, it must call the function
cublasDestory()to rclcasc the rcsourccs associated with the cubLAs library context
This approach allows the user to explicitly control the library setup when using
multiple host threads and multiple GPUs. For example, the application can use
cudaSetDevice to associate different devices with different host threads and in each
of those host threads it can initialize a unique handle to the cuBlAS library context,
which will use the particular device associated with that host thread Then the cublas
library function calls made with different handle will automatically dispatch the
computation to diffcrent devices
The device associated with a particular cuBLAS context is assumed to remain
unchanged between the corresponding cublasCreate()and cublasDestory( calls
In order for the cublas library to use a different device in the same host thread, the
application must set the new device to be used by calling cudaSetDevice()and then
create another cuBLAS context, which will be associated with the new device, by calling
cuBLAS Library
Using the cuBLAS API
2.1.3. Thread safety
The library is thread safe and its functions can be called from multiple host threads
even with the same handle. When multiple threads share the same handle, extreme care
needs to be taken when the handle configuration is changed because that change will
affect potentially subsequent CUBLAS calls in all threads. It is even more true for the
destruction of the handle. So it is not recommended that multiple thread share the same
CublaS handle
2.1.4. Results reproducibility
By design all Cublas api routines from a given toolkit version generate the same bit-
wise results at every run when executed on gPus with the same architecture and the
same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit
version because the implementation might differ due to some implementation changes
For some routines such as cublassymv and cublashemv, an
alternate significantly faster routines can be chosen using the routine
cublasSetAtomicsMode(). In that case, the results are not guaranteed to be bit-wise
reproducible because atomics are used for the computation
2.1.5. Scalar Parameters
There are two categories of the functions that use scalar parameters:
functions that take alpha and/or beta parameters by reference on the host or the
device as scaling factors, such as gemm
functions that return a scalar result on the host or the device such as amax,amin
asum(), rotg(, rotmg(), dot( and nrm2()
For the functions of the first category, when the pointer mode is set to
CUBLAS POINTER MODE HOST, the scalar parameters alpha and/or beta can b
on the stack or allocated on the heap. Underneath the cuda kernels related to
that functions will be launched with the value of alpha and/or beta. Therefore if
they were allocated on the heap they can be freed just after the return of the call
even though the kernel launch is asynchronous when the pointer mode is set to
CUBLAS POINTER MODE DEVICE, alpha and or beta must be accessible on the
device and their values should not be modified until the kernel is done. note that since
cudaFree()does an implicit cudaDeviceSynchronize(), cudaFree()can still be
called on alpha and or beta just after the call but it would defeat the purpose of using
this pointer mode in that case
For the functions of the second category, when the pointer mode is set to
CUBLAS POINTER MODE HOST. these functions blocks the cpu until the gpu has
completed its computation and the results has been copied back to the host. when
the pointer mode is set to cuBlas POINTER MODE DEVICe, these functions return
immediately. In this case, similarly to matrix and vector results the scalar result is ready
only when execution of the routine on the GpU has completed. This requires proper
synchronization in order to read the result from the host
cuBLAS Library
Using the cuBLAS API
In cither casc, the pointer modc CUBLAS POINTER MODE DEVICE allows thc library
functions to execute completely asynchronous y from the host even when alpha
and/or beta are generated by a previous kernel. For example, this situation can arise
when iterative methods for solution of linear systems and eigenvalue problems are
implemented using the cuBLaS library
2.1.6. Parallelism with streams
If the application uses the results computed by multiple independent tasks, CUDA
streams can be used to overlap the computation performed in these tasks
The application can conceptually associate each stream with each task. In order to
achieve the overlap of computation between the tasks, the user should create CUDATM
streams using the function cudastreamcreate() and set the stream to be used by each
individual cuBLas library routine by calling cublas Setstream() just before callin
the actual cublas routine. Then the computation performed in separate streams would
be overlapped automatically when possible on the GPU. This approach is especiall
useful when the computation performed by a single task is relatively small and is not
enough to fill the gpu with work.
We recommend using the new cuBLAS API with scalar parameters and results passed
by reference in the device memory to achieve maximum overlap of the computation
when using streams
A particular application of streams, batching of multiple small kernels, is described
2.1.7. Batching Kernels
In this section we will explain how to use streams to batch the execution of small
kernels. For instance, suppose that we have an application where we need to make many
small independent matrix-matrix multiplications with dense matrices
It is clear that even with millions of small independent matrices we will not be able to
achieve the same GFLOPS rate as with a one large matrix. For example, a single nxn
large matrix-matrix multiplication performs n3 operations for n 2 input size while 1024
small matrix-matrix multiplications perform 1024
2432=32 operations for the
same input size. However, it is also clear that we can achieve a significantly better
performance with many small independent matrices compared with a single small
The architecture family of GPUs allows us to execute multiple kernels simultaneously.
Hence, in order to batch the execution of independent kernels we can run each of
them in a separate stream. In particular, in the above example we could create 1024
CUDa streams using the function cudaStreamCreate () then preface each call to
cublasgemm()with a call to cublasSetstream() with a different stream for each
of the matrix-matrix multiplications. This will ensure that when possible the different
computations will be executed concurrently. although the user can create many streams
in practice it is not possible to have more than 16 concurrent kernels executing at the
same time
cuBLAS Library
Using the cuBLAS API
2.1.8. Cache configuration
On some devices, Li cache and shared memory use the same hardware resources
The cache configuration can be set directly with the CUDA Runtime function
cudaDevicesetCacheConfig. The cache configuration can also be set specifically for
some functions using the routine cudaFuncSet Cache Config Please refer to the CUDA
Runtime api documentation for details about the cache configuration settings
Because switching from one configuration to another can affect kernels concurrency
the CuBLAS Library does not set any cache configuration preference and relies on the
current setting. However, some CuBLAS routines, especially Level-3 routines, rely
heavily on shared memory. Thus the cache preference setting might affect adversely
their performance
2.1.9. Device APl Library
Starting with release 5.0, the CUDA Toolkit now provides a static CuBLAS Library
cublas device. a that contains device routines with the same api as the regular cublas
Library. Those routines use internally the Dynamic Parallelism feature to launch kernel
from within and thus is only available for device with compute capability at least equal
In order to use those library routines from the device the user must include the header
file "cublas v2.h"corresponding to the new cublas api and link against the static
CuBLAS library cublas dev
Those device cublas library routines are called from the device in exactly the same way
they are called from the host, with the following exceptions
The legacy cuBLAs APi is not supported on the device
The pointer mode is not supported on the device, in other words, scalar input and
output parameters must be allocated on the device memory.
Furthermore, the input and output scalar parameters must be allocated and released
on the device using the cudamalloc and cuda free routines from the host respectively
or malloc and free routines from the device, in other words, they can not be passed by
reference from the local memory to the routines
2.1.10. Static Library support
Starting with release 6.5, the cublas Library is also delivered in a static form as
libcublas static a on Linux and Mac Oses. The static cublaS library and all others static
maths libraries depend on a common thread abstraction laver library called libculibos a
For example, on Linux, to compile a small application using Cu.AS,against the
dynamic library, the following command can be used
nvcc my CublasApp.c -lcublas -o my CublasApp
cuBLAS Library
