CUBLAS_Library.pdfcuda9.0里面的cublas文档，对于cuda开发从业人士非

文件名称: CUBLAS_Library.pdf

所属分类: 深度学习

开发工具:

文件大小: 2mb

下载次数: 0

上传时间: 2019-10-08

提供者: bai****

下载 (2mb)

不能下载？报告错误

详细说明：cuda9.0里面的cublas文档，对于cuda开发从业人士非常必要基础的一本小册子，不可错过Chapter 1 INTRODUCTION The cuBlaS library is an implementation of BLAS(Basic Linear Algebra Subprograms on top of the NvIDIACUDA runtime It allows the user to access the computational resources of NVIDIA Graphics Processing Unit(GPU) Starting with CUDA 6.0, the cuBLAS Library now exposes two sets of API, the regular cuBLAS API which is simply called cuBLAS API in this document and the CUBLASXT API To use the cuBlAS APL, the application must allocate the required matrices and vectors in the gpu memory space fill them with data, call the sequence of desired cuBlas functions, and then upload the results from the GPU memory space back to the host The cuBLAS API also provides helper functions for writing and retrieving data from the GPU To use the CUBLASXT API, the application must keep the data on the Host and the Library will take care of dispatching the operation to one or multiple gPUS present in the system, depending on the user request 1. Data layout For maximum compatibility with existing Fortran environments, the cuBLAS library uses column-major storage, and 1-based indexing. Since C and C++ use row-major storage, applications written in these languages can not use the native array semantics for two-dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of one-dimensional arrays For Fortran code ported to C in mechanical fashion one may chose to retain 1-based indexing to avoid the need to transform loops. In this case the array index of a matrix element in row i" and column can be computed via the following macro # define IDX2F(i,j;1d)((((j)-1)*(1d))+((=)-1)) Here, Id refers to the leading dimension of the matrix, which in the case of column-major storage is the number of rows of the allocated matrix(even if only a submatrix of it is being used). For natively written C and C++ code, one would most likely choose O-based www.nvidla.com cuBLAS Library DU-06702001v9.0|1 Introduction indexing, in which case the array index of a matrix clement in row i"and column"j can be computed via the following macro tdefine IDX2C (i,j, ld)((()*(ld))+(i)) 2. New and Legacy cuBLAS AP Starting with version 4.0, the cuBLAS Library provides a new updated APl, in addition to the existing legacy API. This section discusses why a new API is provided, the advantages of using it, and the differences with the existing legacy API The new cublaS library api can be used by including the header file"cublas v2.h".It has the following features that the legacy cublas api does not have: the handle to the cuBLAS library context is initialized using the function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple gPus. This also allows the CuBLaS APIs to be reentrant the scalars a and B can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This change allows library functions to execute asynchronously using streams even when a and B are generated by a previous kernel when a library routine returns a scalar result, it can be returned by reference on the host or the device, instead of only being allowed to be returned by value only on the host This change allows library routines to be called asynchronously when the scalar result is generated and returned by reference on the device resulting in maximum parallelism the error status cublas status t is returned by all cublas library function calls This change facilitates debugging and simplifies softwarc devclopmcnt. Note that cublasstatus was renamed cublasstatus t to be more consistent with other types in the Cublas library the cublasAlloc ()and cublasFree()functions have been deprecated This change removes these unnecessary wrappers around cudaMalloc()and cudaFree(), respectivcly the function cublasSetKernelStream() was renamed cublas Setstream to be more consistent with the other Cuda libraries The legacy CuBLAS API, explained in more detail in the Appendix A, can be used by including the header file"cublas. h". Since the legacy aPi is identical to the previously released cuBlAS library APl, existing applications will work out of the box and automatically use this legacy API without any source code changes. In general, new applications should not use the legacy CuBLAS API, and existing existing applications should convert to using the new API if it requires sophisticated and optimal stream parallelism or if it calls CubLAS routines concurrently from multiple threads. For the rest of the document, the new cuBLas Library api will simply be referred to as the cuBlas Library API As mentioned earlier the interfaces to the legacy and the cublas library apis are the header file"cublas. h"and"cublas_v2. h",respectively. In addition, applications using the cuBlas library need to link against the DSo cublas.So(Linux), the DLL cublas. dll www.nvidla.com cuBLAS Library DU-06702001v9.0|2 Introduction (Windows), or the dynamic library cublas dylib(Mac Os X). Notc: the same dynamic library implements both the new and legacy CuBLAS APIs 1.3. EXample code For sample code references please see the two examples below. They show an application written in C using the cubLAS library aPI with two indexing styles www.nvidla.com cuBLAS Library DU-06702001V9.0|3 Introduction (Examplc 1. Application Using C and CUBLAS: 1-bascd indexing"and Examplc 2 Application Using C and CUBLAS: O-based Indexing") //Example 1. App-ication Using c ard CUBLAS: 1-based indexing include #include #立 include include " cublas v2.h f define M 6 t define N 5 def IDx2F(i,;1d)((((j)-1)*(1d))+(()-1)) static irline void modify (cublasHand-e t handle, flcat *m, -nt ldm, int t p, int float alpha, float beta cublassscal(handle, n-p+l, &alpha, &n[IDX2F(p, g, ldm)], ldm) cublassscal (handle, ldm-p+l, &beta, &m[IDX2F(p, q, ldm)], 1) int main (void) cudaerror t cudastat cublasstatus t stat ublasFandle t handle float* devptrAi float* a =0 float *)malloc (M * n* sizeof (xa)) f printf (host memory allocation failcd")i return EⅩ IT FAILURE for(=1;j<=N;j++){ Or M;i++){ a[IDX2F(i,j, M)]=(float)((1-1)*M+j) cudastat daMa lloc ((void**)&devptrA, M*N*sizeof(*a)) if(cudastat ! cudaSuccess) intf ("de emory allocation failed") turn eX工 T FAILURE tat blascreate(&handle) f(stat CUBLAS STATUS SUCCESS) printf ("CUBLAs initialization fa:led\n)i return EXIT FAIlure stat cublasSctMatrix (M, N, sizeof(xa) a, M, devptrA M) if (stat CUBLAS STATUS SUCCESS) print ("data down load failed") cudafree (devptrA)i cublasDestroy (andle)i return EX工 T FAILURE modify (hand e, devptrA, M, N,2,3,=6.0f, 120f)i stat cub l asgetMatrix (M, n, sizeof(xa) devptrA, m, a, m) if(stat CUBLAS STATUS SUCCESS) printf ("data upload failed") cudafree (devptrA)i cublasDestroy (handle)i return EⅩ IT FAILURE; cudafree (devptrA)i cublas Destroy(handle) for(j=1;j<=N;j++){ for (i= lii print= (67 Of", a[IDX2F(i,3, M)]); printf (w\n)i free(a) return. EXIT SUCCESSi /Example 2. App-ication Using C ard CUBLAS: 0-based indexing Chapter 2. USING THE CUBLAS API 2. 1. General description This section describes how to use the cublas library apl. it does not contain a detailed reference for all API datatypes and functions-those are provided in subsequent chapters The Legacy cuBLAS API is also not covered in this section-that is handled in an A ppendIx. 2.1.1. Error status All CuBlas library function calls return the error status cublasstatus t 2.1.2. CUBLAS context The application must initialize the handle to the cublas library context by calling the cublascreate()function. Then, the is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory()to rclcasc the rcsourccs associated with the cubLAs library context This approach allows the user to explicitly control the library setup when using multiple host threads and multiple GPUs. For example, the application can use cudaSetDevice to associate different devices with different host threads and in each of those host threads it can initialize a unique handle to the cuBlAS library context, which will use the particular device associated with that host thread Then the cublas library function calls made with different handle will automatically dispatch the computation to diffcrent devices The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate()and cublasDestory( calls In order for the cublas library to use a different device in the same host thread, the application must set the new device to be used by calling cudaSetDevice()and then create another cuBLAS context, which will be associated with the new device, by calling cublascreate( www.nvidla.com cuBLAS Library DU-06702001v9.0|5 Using the cuBLAS API 2.1.3. Thread safety The library is thread safe and its functions can be called from multiple host threads even with the same handle. When multiple threads share the same handle, extreme care needs to be taken when the handle configuration is changed because that change will affect potentially subsequent CUBLAS calls in all threads. It is even more true for the destruction of the handle. So it is not recommended that multiple thread share the same CublaS handle 2.1.4. Results reproducibility By design all Cublas api routines from a given toolkit version generate the same bit- wise results at every run when executed on gPus with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit version because the implementation might differ due to some implementation changes For some routines such as cublassymv and cublashemv, an alternate significantly faster routines can be chosen using the routine cublasSetAtomicsMode(). In that case, the results are not guaranteed to be bit-wise reproducible because atomics are used for the computation 2.1.5. Scalar Parameters There are two categories of the functions that use scalar parameters: functions that take alpha and/or beta parameters by reference on the host or the device as scaling factors, such as gemm functions that return a scalar result on the host or the device such as amax,amin asum(), rotg(, rotmg(), dot( and nrm2() For the functions of the first category, when the pointer mode is set to CUBLAS POINTER MODE HOST, the scalar parameters alpha and/or beta can b on the stack or allocated on the heap. Underneath the cuda kernels related to that functions will be launched with the value of alpha and/or beta. Therefore if they were allocated on the heap they can be freed just after the return of the call even though the kernel launch is asynchronous when the pointer mode is set to CUBLAS POINTER MODE DEVICE, alpha and or beta must be accessible on the device and their values should not be modified until the kernel is done. note that since cudaFree()does an implicit cudaDeviceSynchronize(), cudaFree()can still be called on alpha and or beta just after the call but it would defeat the purpose of using this pointer mode in that case For the functions of the second category, when the pointer mode is set to CUBLAS POINTER MODE HOST. these functions blocks the cpu until the gpu has completed its computation and the results has been copied back to the host. when the pointer mode is set to cuBlas POINTER MODE DEVICe, these functions return immediately. In this case, similarly to matrix and vector results the scalar result is ready only when execution of the routine on the GpU has completed. This requires proper synchronization in order to read the result from the host www.nvidla.com cuBLAS Library DU-06702001V9.0|6 Using the cuBLAS API In cither casc, the pointer modc CUBLAS POINTER MODE DEVICE allows thc library functions to execute completely asynchronous y from the host even when alpha and/or beta are generated by a previous kernel. For example, this situation can arise when iterative methods for solution of linear systems and eigenvalue problems are implemented using the cuBLaS library 2.1.6. Parallelism with streams If the application uses the results computed by multiple independent tasks, CUDA streams can be used to overlap the computation performed in these tasks The application can conceptually associate each stream with each task. In order to achieve the overlap of computation between the tasks, the user should create CUDATM streams using the function cudastreamcreate() and set the stream to be used by each individual cuBLas library routine by calling cublas Setstream() just before callin the actual cublas routine. Then the computation performed in separate streams would be overlapped automatically when possible on the GPU. This approach is especiall useful when the computation performed by a single task is relatively small and is not enough to fill the gpu with work. We recommend using the new cuBLAS API with scalar parameters and results passed by reference in the device memory to achieve maximum overlap of the computation when using streams A particular application of streams, batching of multiple small kernels, is described below 2.1.7. Batching Kernels In this section we will explain how to use streams to batch the execution of small kernels. For instance, suppose that we have an application where we need to make many small independent matrix-matrix multiplications with dense matrices It is clear that even with millions of small independent matrices we will not be able to achieve the same GFLOPS rate as with a one large matrix. For example, a single nxn large matrix-matrix multiplication performs n3 operations for n 2 input size while 1024 立×显 small matrix-matrix multiplications perform 1024 2432=32 operations for the same input size. However, it is also clear that we can achieve a significantly better performance with many small independent matrices compared with a single small matrix The architecture family of GPUs allows us to execute multiple kernels simultaneously. Hence, in order to batch the execution of independent kernels we can run each of them in a separate stream. In particular, in the above example we could create 1024 CUDa streams using the function cudaStreamCreate () then preface each call to cublasgemm()with a call to cublasSetstream() with a different stream for each of the matrix-matrix multiplications. This will ensure that when possible the different computations will be executed concurrently. although the user can create many streams in practice it is not possible to have more than 16 concurrent kernels executing at the same time www.nvidla.com cuBLAS Library DU-06702001v9.0|7 Using the cuBLAS API 2.1.8. Cache configuration On some devices, Li cache and shared memory use the same hardware resources The cache configuration can be set directly with the CUDA Runtime function cudaDevicesetCacheConfig. The cache configuration can also be set specifically for some functions using the routine cudaFuncSet Cache Config Please refer to the CUDA Runtime api documentation for details about the cache configuration settings Because switching from one configuration to another can affect kernels concurrency the CuBLAS Library does not set any cache configuration preference and relies on the current setting. However, some CuBLAS routines, especially Level-3 routines, rely heavily on shared memory. Thus the cache preference setting might affect adversely their performance 2.1.9. Device APl Library Starting with release 5.0, the CUDA Toolkit now provides a static CuBLAS Library cublas device. a that contains device routines with the same api as the regular cublas Library. Those routines use internally the Dynamic Parallelism feature to launch kernel from within and thus is only available for device with compute capability at least equal to3.5. In order to use those library routines from the device the user must include the header file "cublas v2.h"corresponding to the new cublas api and link against the static CuBLAS library cublas dev Those device cublas library routines are called from the device in exactly the same way they are called from the host, with the following exceptions The legacy cuBLAs APi is not supported on the device The pointer mode is not supported on the device, in other words, scalar input and output parameters must be allocated on the device memory. Furthermore, the input and output scalar parameters must be allocated and released on the device using the cudamalloc and cuda free routines from the host respectively or malloc and free routines from the device, in other words, they can not be passed by reference from the local memory to the routines 2.1.10. Static Library support Starting with release 6.5, the cublas Library is also delivered in a static form as libcublas static a on Linux and Mac Oses. The static cublaS library and all others static maths libraries depend on a common thread abstraction laver library called libculibos a For example, on Linux, to compile a small application using Cu.AS,against the dynamic library, the following command can be used nvcc my CublasApp.c -lcublas -o my CublasApp www.nvidla.com cuBLAS Library DU-06702001v9.0|8

(系统自动生成,下载前可以参看下载内容)