A Framework for Generating High Throughput CNN Imp

文件名称: A Framework for Generating High Throughput CNN Implementations on FPGAs.pdf

所属分类: 深度学习

开发工具:

文件大小: 2mb

下载次数: 0

上传时间: 2019-07-20

提供者: shiya******

下载 (2mb)

不能下载？报告错误

详细说明：一种FPGA硬件加速方案，实现深度学习，可实现高吞吐量的CNN网络Session 3: Deep Learning FPGA 18, February 25-27, Monterey, CA, USA maps. Let b, n and m index into the Batch, fin and fout dimensions Table 1: Variation of model paramcters Equation 4 specifies the operations of a convolution layer max Mayer (i yer CNN Conv Layers lk max fin min fin min m n ern min lime n kern (2)Duality of the OaA and CaP operations: OaA partitions im- 3.2 CaP for Reducing Wasted Computations ages, and Cap combines images. OaA processes a set of matrices by overlapping pixels(Step 4 in Figure 1), and Cap processes a set Oaa requires the shape of each partition to be N X N. The analysis f matrices by padding pixels(Step Cap in Figure 4). Since Cap is on equation ignores the useless computation on the zero paddings a dual of OaA, we can extend the e operator(Section 2.1). If the of P. Such approximation is not always valid, as can be seen in superscript x is negative, then we use e to compute step b in Figure 2a when N=32 or 64. Two examples are shown in Figure 3 Figure 1. If x is positive, we use e to compute I=6 (D)in CaP Scenario 1 is for deep layers when limg is small and scenario 2 happens when ling is larger In summary, we Cap the layer array so that input of Batchxfin X One possible solution is to select an appropriate n which fits well ling is reshaped to x fin xlimg where limg =d. limg+(d limg of most layers. The first problem is, this technique significantly 1)(kern -1). We then apply OaA to I. Abbreviate such operations 120 Session 3: Deep Learning FPGA 18, February 25-27, Monterey, CA, USA 3.3 Frequency domain Loop Tiling The CaP-Oaa technique manipulates the data dimensions limg and Kern. To block data of convolution layers into identical shapes, we 2.0 still need optimization on the fin and fout dimensions We revisit Algorithm 1. Tiling of the loop dimensions in lines 5 and 6 performs partitioning of fin and fout. In runtime, the kernel 1.0 filters and image data are partitioned into fixed shapes, and the tiles are loaded onto FPGA. Tiling on top of CaP-OaA makes the CaA N=16 CaA N=32 Native CaP-OAA N=16 CaP-OaA N=32 data flow of diverse CNNs on a target device identical to each other. The tiling factor f is the same for various convolution layers. After Cap-Oaa transforms the kernel filters and images to an uniform Figure 5: Comparison of computation complexity NXN shape, value of f becomes independent of the CNn model parameters, and is solely bound by the on-chip memory size. The motivation for loop tiling is to reduce the communication volume CaP-OaA. It is worth noticing that the various frequency domain to external memory by increased reuse of on-chip data [4].For convolution algorithms discussed so far are closely related to each frequency domain convolution, tradeoff exists between N and f to other. CaP-OaA reduces to oaa whend= 1. OaA further reduces balance computation complexity and data reuse Analysis on the o native frequency domain convolution whenn> lime + kern algorithm-architecture co-design is made in Section 5 1. Therefore, CaP-OaA is the most general version among these Although loop optimization for CNNs on FPGAs has been ex frequency domain convolution algorithms. Cap-Oaa also achieves tensivcly studicd, previous work [4, 8, 12]focused on convolution the highest hardware efficiency in space domain. Existing techniques cannot be directly applied to We further quantitatively analyze the computation complexity of frequency domain CNNS, since data flow of sliding window opera CaP-OaA CaP introduces a new variable d whose value can be set tions is different fron Hadamard product operations. On the other to approximate the ceiling function in Equation 5. It can be shown hand, our three techniques proposed in Section 3.1, 3.2 and 3.3 can that by setting d N-Lkern-1) (where gcd means ll be understood as loop optimizations in frequency domain OaA is analogous to loop tiling of limg, and CaP is analogous to loop Greatest Common Divisor), the complexity of Cap-OaA is tiling and unrolling of the Batch dimension With the optimizations in Section 3. 1, 3.2 and 3.3, we derive 0Cnp-n4

(系统自动生成,下载前可以参看下载内容)