文件名称:
A Framework for Generating High Throughput CNN Implementations on FPGAs.pdf
开发工具:
文件大小: 2mb
下载次数: 0
上传时间: 2019-07-20
详细说明:一种FPGA硬件加速方案,实现深度学习,可实现高吞吐量的CNN网络Session 3: Deep Learning
FPGA 18, February 25-27, Monterey, CA, USA
maps. Let b, n and m index into the Batch, fin and fout dimensions
Table 1: Variation of model paramcters
Equation 4 specifies the operations of a convolution layer
max
Mayer
(i yer
CNN
Conv Layers lk
max fin
min fin
min
m n
ern min lime
n kern
(2)Duality of the OaA and CaP operations: OaA partitions im-
3.2 CaP for Reducing Wasted Computations
ages, and Cap combines images. OaA processes a set of matrices
by overlapping pixels(Step 4 in Figure 1), and Cap processes a set
Oaa requires the shape of each partition to be N X N. The analysis
f matrices by padding pixels(Step Cap in Figure 4). Since Cap is
on equation ignores the useless computation on the zero paddings
a dual of OaA, we can extend the e operator(Section 2.1). If the
of P. Such approximation is not always valid, as can be seen in
superscript x is negative, then we use e to compute step b in
Figure 2a when N=32 or 64. Two examples are shown in Figure 3
Figure 1. If x is positive, we use e to compute I=6 (D)in CaP
Scenario 1 is for deep layers when limg is small and scenario 2
happens when ling is larger
In summary, we Cap the layer array so that input of Batchxfin X
One possible solution is to select an appropriate n which fits well ling is reshaped to x fin xlimg where limg =d. limg+(d
limg of most layers. The first problem is, this technique significantly 1)(kern -1). We then apply OaA to I. Abbreviate such operations
120
Session 3: Deep Learning
FPGA 18, February 25-27, Monterey, CA, USA
3.3 Frequency domain Loop Tiling
The CaP-Oaa technique manipulates the data dimensions limg and
Kern. To block data of convolution layers into identical shapes, we
2.0
still need optimization on the fin and fout dimensions
We revisit Algorithm 1. Tiling of the loop dimensions in lines 5
and 6 performs partitioning of fin and fout. In runtime, the kernel
1.0
filters and image data are partitioned into fixed shapes, and the
tiles are loaded onto FPGA. Tiling on top of CaP-OaA makes the
CaA N=16
CaA N=32
Native
CaP-OAA N=16
CaP-OaA N=32
data flow of diverse CNNs on a target device identical to each other.
The tiling factor f is the same for various convolution layers. After
Cap-Oaa transforms the kernel filters and images to an uniform
Figure 5: Comparison of computation complexity
NXN shape, value of f becomes independent of the CNn model
parameters, and is solely bound by the on-chip memory size. The
motivation for loop tiling is to reduce the communication volume
CaP-OaA. It is worth noticing that the various frequency domain
to external memory by increased reuse of on-chip data [4].For
convolution algorithms discussed so far are closely related to each
frequency domain convolution, tradeoff exists between N and f to
other. CaP-OaA reduces to oaa whend= 1. OaA further reduces
balance computation complexity and data reuse Analysis on the
o native frequency domain convolution whenn> lime + kern
algorithm-architecture co-design is made in Section 5
1. Therefore, CaP-OaA is the most general version among these
Although loop optimization for CNNs on FPGAs has been ex
frequency domain convolution algorithms. Cap-Oaa also achieves
tensivcly studicd, previous work [4, 8, 12]focused on convolution
the highest hardware efficiency
in space domain. Existing techniques cannot be directly applied to
We further quantitatively analyze the computation complexity of
frequency domain CNNS, since data flow of sliding window opera
CaP-OaA CaP introduces a new variable d whose value can be set
tions is different fron Hadamard product operations. On the other
to approximate the ceiling function in Equation 5. It can be shown
hand, our three techniques proposed in Section 3.1, 3.2 and 3.3 can
that by setting d
N-Lkern-1)
(where gcd means
ll be understood as loop optimizations in frequency domain OaA
is analogous to loop tiling of limg, and CaP is analogous to loop
Greatest Common Divisor), the complexity of Cap-OaA is
tiling and unrolling of the Batch dimension
With the optimizations in Section 3. 1, 3.2 and 3.3, we derive
0Cnp-n4Ccomm (N")? N: N
C++
Verilog
Assembler
Figure 8: Tool workflow
12.2N
transpose. We take the 1D FFT template from [13]. SPN is a folded
s0.8
CLOS network including two spatial permutation stages and one
temporal permutation stage. We implement the in-place permuta
0.6
tion in time algorithm [2]to generate the control bits. HAC includes
a memory controller to fetch data from BUFi and BUFk Since our
三
architecture does not involve any runtime reconfiguration, Hard
ware Generation Engine statically computes all the SPN control bits
Comp bound
and HAC input addresses in dcsign time. The Assembler connects
bound
the 2D FIT, 2D ITT, IIAC, BUF and burK based on Figure 6
FFT size N
7 EXPERIMENTAL RESULTS
Figure 7: Design chart for hardware mapping
7.1 Experimental Setup
falling between 0.2 to0.35, 0.52 to 1.05, 1.45 to 2.37, the designs
We use Intel Heterogeneous Research Platform(HARp)[1] for eval-
uation.IIARP has shared memory accessible to the CPU and FPGA
are computation bound Optimal N is 8, 16 and 32 respectively
(these are shown by three red marks a, b, c). Similarly, for other
The FPGa is an Intel Stratix V Gxa7 device, with 5 GB/s band
devices, the designs are communication bound (design points falling
width to external memory, 6.25 MB on-chip memory, 256 DSPs and
between b and b, c and c). Using the design chart we can identify
234720 ALMS The cPu of harP is a 10-core Intel Xeon E5-2600 v2
target devices that are best suited for our architecture. D
processor. We use 16-bit fixed-point data representation to compute
CNNS. The designs were synthesized by Quartus I(version 13.1.0
with their roofline intersections falling at the blue vertical lines
(N =8, 16, 32, 64) have perfectly balanced rcsourccs in terms of L,
In the following throughput is calculated as the total number of
M and B(e.g, devices with KFPGA=0.2, 0.5 or 1.18
operations for spatial convolution divided by the average execution
timc pcr image for our frequency domain approach. Numerator for
6 AUTOMATIC CODE GENERATION
spatial convolution let us make fair comparison with other works
TH
he is the actual execution time on harp
We have developed a tool [20] to automatically generate the archi-
The architecture for all CNNs under evaluation is configured as
tecture on the target device. Figure 8 shows the workflow of the tool. N=16,f=64,Uk=3, Frow=4, Fcol=16. We set an upper limit
The inputs are the cnn model parameters for each convolution
for d(s 15)to bound the batch size
in and
layer(img, Kern, fin and fout ), and the meta data of the target de
For workload distribution between fpga and cpu. fpga exe-
vice(B, L and M). The outputs includes C++ code for book-keeping
cutes all convolution layers of AlexNet, VGG16 and FCN-16s except
the data blocks (lines 1-5 and 14-18, Algorithm 2), and synthesiz-
he first convolution layer of AlexNet, while the CPU executes
able Verilog performing the computational expensive convolution all the remaining layers(pooling, ReLU, fully connected and first
(lines 6-13, Algorithm 2). The Mapping Engine feeds the CaP-Oaa convolution of AlexNet) In summary, the CPU executes 15%, 1%
parameters(n, d)and tiling factor (f)into Software Generation and 1% of the total computation for Alex Net, VGG10 and FCN-16
Engine, and feeds architectural parameters into Hardware Genera
respectively. We implement the first convolution layer of AlexNet
tion Enginc. Optionally, uscrs can specify additional constraints to
using the Blas [17] library. By a simple batch processing pipel
the tool such as available fft sizes and maximum d
execution time of CPU is completely hidden by FPGA
Software Generation. Although the optimal batch folding factor
d varies across convolution layers, we use a uniform d for all layers
7.2 Impact of Algorithmic Optimizations
of a CNN in implementation. This ensures that the output of the To vary the input image size, we use AlexNet and VGG16 to execute
previous layer can be directly fed into the following layer without feature extraction by skipping their fully-connected layers. We
further layout rearrangement.
execute all layers of FCN-16s
hardware Generation The 2D FIT module consists of 1d fit
Efect of CaP. We use the architecture configuration as specified
pipelines and Streaming Permutation Networks(SPN) for matrix in Section 7.1. We vary limg of the first convolution layer from 160
124
Session 3: Deep Learning
FPGA 18, February 25-27, Monterey, CA, USA
to 304 for AlexNet and VGG16, and from 320 to 608 for FCN-16s(In leads to much less number of operations and much simpler data
other words. I?. of the last convolution laver for the three cnns
flow compared with the sliding window operation
vary from 10 to 19). Figure 2 shows the comparison of computation
To the best of our knowledge, this is the first work that accel-
complexity for frequency domain convolution using CaP-OaA, OaA erates FCN-16s on FPGAs. As shown in Figure 10, approximately,
and spatial convolution lach bar is vertically stacked by the number
throughput of 550 GOPS is achieved for images of various sizes
of operations for each convolution layer of the CNNs. Figure 10
shows the comparison of the measured throughput on HARP.
8 RELATED WORK
When ling is divisible by (n-lkern+ 1(e.g. limg =224 for Accelerating spatial convolution has been extensively studied from
AlexNet and VGG16), performance of Oa A is identical to CaP-OaA. the perspective of loop opcration optimization [ 4, 12]and data flow
However, in other cases, CaP-Oaa delivers much better perfor- optimization [3]. Work in [4] proposed a roofline model to capture
mance than OaA. For example, when ling =240 for AlexNet, various techniques including loop tiling, unrolling and interchang
VGG16 and Limg-352 for FCN-16s, CaP-OaA leads to 2.3x, 1.5x ing [12] further optimized performance by a thorough design space
and 1. 7X complexity rcduction, and 2. 3x, 1.5X and 1.7X throughput exploration. 22] boosted throughput under the OpenCL framework
improvement. Furthermore, we observe that the performance of
Spatial convolution based approaches will eventually be bound by
OaA is highly sensitive to image sizes. For AlexNet and VGG16
the computation complexity of the convolution algorithm On the
performance drops significantly when the image size increases from
other hand, alternatives such as convolution by winograd trans
224 to 240. This reflects the padding effect of OaA
form and frequency domain convolution have been proposed and
Effect of n. Next, we experiment how selecting various n af-
implemented [10, 14, 21]. Winograd based approaches do not easily
fects the throughput of the system. Since parameters N and f are
generalize to CNNs with various kernel window sizcs. Whilc the ap-
togcthcr dependent on the on-chip memory sizc. by varying N,we
proaches based on frequency domain convolution are more flexible,
are exploring the effect of loop tiling as well. Figure 11 shows the
further optimizations to [21] can be performed when processing
normalized throughput of the three CNNs on the design chart when
high dimensional data of convolution layers(this work
using N=8, 16, 32. The corresponding/ values are 128, 64.32
predicted by the design chart, n=16 is the best configu-
9 CONCLUSION
ration on the Stratix-V GXA7 device. When N=8, the increased We presented a framework for generating high throughput CNn
computation complexity degrades the performance. When N= 32, accelerators. Combining the CaP, OaA and frequency domain loop
the low data reuse makes external bandwidth the bottleneck. fur
tiling techniques together, our framework generates architectures
thermore, despite the dramatically different network structure, the
accelerating diverse CNNs without runtime reconfiguration
normalized throughput of the three CNns are very close to each
In the future, we will explore the hybrid algorithm combining
othcr. This demonstrates the cffcct of our algorithmic optimization
convolution in space and frequency domain. Spatial convolution is
as cfficicnt as frcqucncy domain convolution for 1 x 1 kernels. In
7.3 Comparison with State-of-the-Art
such cases, we may switch to spatial convolution which leads to
For AlexNet and VGG16, we use the ImageNet dataset(img=224)
better hardware utilization In addition, as techniques have been
Table 2 summarizes the comparison with state-of-the-art designs
developed to make use of the sparsity in spatial convolution, we will
All the designs cxcept[21 usc similar or lower precision data repro
explore if similar techniques can be applied in frequency domain
sentation than our designs. In [21 ], frequency domain convolution
using the Oaa technique was employed. However, their analysis
10 ACKNOWLEDGEMENTS
was based on a metric called"delay-multiplier product"evaluating This work was supported by the US NSf under grants CNS-1643351
convolution of a single image rather than a complete layer Using ACI-1339756 and CCF-1320211 This work is also supported in part
the same FPGA, we show 9.4X(AlexNet)and 5.4X(VGG16)speed by Intel Strategic Research Alliance funding. Equipment grant from
up in throughput as a result of a deeper analysis on frequency do- the Intel Hardware Accelerator Research Program is gratefully
main convolution. All other works are based on spatial convolution
acknowledged
Compared with [18] which uses the same target FPGA and data
representation as this project, we achieve 5.8X speedup Compared REFERENCES
with [7],[8]and [12], when we usc the samc data representation(16
[12015.InteliNc.Xeon+fpgaPlatformfortheDataCenter.(2015).https://www
bit fixed point), our designs achieve 1.4x, 4.9x and 1.0x speedup,
ece. cmu.edu/calcm/carl/lib/exe/fetch. php? media=carl15-gupta pdf
even though our target device has14.0×,3.4×and5.9× less dsp
[2 R Chen, H Le, and V K Prasanna 2013. Energy efficient parameterized FFT
architecture. In 2013 23rd IntL Conf on Field programmable Logic and Applications
resources. Using a device with 5.9x more DSPs, [22] achieves 2.7x
[31 Y H. Chen, J. Emer, and V. Sze. 2017. Using Dataflow to Optimize energy
higher throughput than us. One main reason is the difference in
Efficiency of Deep Neural Network Accelerators. IEEE Micro 37, 3(2017)
the clock rate. We can not achieve higher clock rate, since haRp
[4] Chen Zhang, et al. 2015. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA Interna
requires the fpga to operate at exactly 200 MHz.
tional Symposium on Field-Programmable Gate Arrays(FPGA '15). ACM
To understand such significant improvement in throughput, we
5 Ali Daher, et al. 2010. Overlap-save and overlap-add filters: Optimal design and
comparison. IEEE Transactions on Signal processing 58, 6(2010)
use [18]as an example to show the improvement breakdown. Out of
[G P Duhamel and II. Ilollmann. 1984. Split radix'FFT algorithm. Electronics Letters
the 5.8X improvement, approximately 3X comes from the reduction
20.1( anuary1984)
n computation complexity(Figure 2b). The remaining 2X comes
[7 Huimin Li, et al. 2016. A high performance FPGA-based accelerator for large-
cale convolutional neural networks. In 2016 26th International Conference on
from the clock rate improvement. The Hadamard product operation
Field Programmable Logic and Applications(FPL)
125
Session 3: Deep Learning
FPGA 18, February 25-27, Monterey, CA, USA
Alexnet
VGGIG
FCN-16
CaP-OaA
OaA
OaA
Spatial
旧
CaP-OaA
OaA
(160)(176(192)(208)(224)(240)256)(272)(288)(304)
(160)176)(192)(208(221)(240)256)(272)(288)(
(320)(352)(384)(416)(418)(480)(512){544)(576)60s)
Image size
Image size
Inage size
Figure 9: Number of operations performed by various convolution algorithms
Table 2 Comparison with state-of-the-art AlexNet and VGG16 implementations(FX: fixed point, FT: floating point)
18
[21]
[21]
Pr
d: p
AlexNet
AlexNet
VGG16
VGG16
VGG16
VGG16
Alex Net
VGG16
FPGa Virtex- Stratix-V Startix-V
Arria-10
Arria-10 Stratix-v Stratix Stratix-V
VC709
GXA7
GXA7
ⅩC7Z045
GX1150
GX1150
GXA7
GXA7
GXAZ
Frequency(MHz)
156
150
200
200
Precision 16 bil FX 8-16 bil fx 32 bil Ft 16 bil FX 8-16 bil Fx 16 bil fx 32 bil ft 16 bil FX 16 bil FX
DSP USage2144(60%)256(100%)224(88%)780(89%)1518(100%)1378(91%)224(88%)256(100%)256(100%)
Logic Usage274K(63%)121K(52%)200K(85%)183K(84%)161K(38%)
200K(85%)107K(46%)107K(46%)
On-chip ram956(65%)1152(61%)1208(64%)486(87%)1900(70%)1450(53%)1208(64%)1377(73%1377(73%)
Throughput(GOPS) 565.9
134.1
83.0
137.0
6453
1790
123.5
780.6
669.1
80
Alexnet
[8] Jiantao Qiu, et al. 2016. Going Deeper with Embedded FPGA Platform for Con
CaP-OaA
volutional Neural Network. In Proceedings of the 2016 ACM/SIGDA international
Alexei
Symposium on Field-Programmable Gate Arrays(FPGA 16). ACM
9 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classifi
0.4
VGG16
cation with deep convolutional neural networks. In NIPS 12
CaP-OaA [10] Andrew Lavin 2015. Fast Algorithms for Convolutional Neural Networks. CoRR
.4(0
--. VGG16
OaA
[11 Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully Convolutional
FCN16s
Networks for Semantic Segmentation. CaRR abs/1411.4038(2014)
200
CaP-oaA
[12] Yufei Ma, et al. 2017. Optimizing Loop Operation and Dataflow in FPGA Ac-
-.上CN16s
celeration of Deep Convolutional Neural Networks. In Proceedings of the 2017
mmable Gate Arrays(FPGA'17
13 Markus Puschel, et al. 2005. SPIRAL: Code Generation for DSP Transforms
1516171819
Proceedings of the IeeE, special issue on Program Generation, Optimization, and
Image size
Adaptation”93(205)
[14]A Podili, C. Zhang, and V Prasanna 2017. Fast and efficient implementation
of Convolutional Neural Networks on FPgA In 2017 IEEE 28th International
Conference on Application-specific Systems, Architectures and Processors(ASAP)
Figure 10: Throughput of AlexNet, VGG16 and FCN-16s
[15 Karen Simonyan and Andrew Zisserman 2014. Very Deep Convolutional Net
works for Large-Scale Image Recognition. CoRR abs/1409. 1556(2014)
alene
[16 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,
VGG16
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi
Y FCN-16s
novich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842(2014)
[17] Xianyi Zhang et al. 2017. opEnbl.as.(2017)"wwW. oPenbLas. net"
[18 Yufei Ma, et al. 2016. Scalable and modularized RTL compilation of Convolutional
0.6
Neural Networks onto FPGA In 2016 26th International Conference on Field
Programmable Logic and Applications(FPL)
三
[19] Hanqing Zeng, Ren Chen, and Viktor K Prasanna 2017. Optimizing Frequency Do-
main Implementation of CNNs on FPGAs. Technical Report. University of Souther
Californiahttp://ceng.usc.edu/techreports/2017/prasanna%20ceng-2017-3.pdf
[20] Hanqing Zeng, Chi Zhang, and Viktor Prasanna 2017. Fast Generation of High
Throughput Customized Deep Learning Accelerators on FPGAs. In 2017 Interna
tional Conference on ReConFigurable Computing and FPGAs( ReConFig)
0.0
[21] C. Zhang and V. Prasanna 2017. Frequency Domain Acceleration of Convolu
FFT Size n
tional Neural Networks on CPU-FPGA Shared Memory System. In Proceedings of
the 2017 ACM/SICDA Intl Symp. on Field-Programmable Gate Arrays(FPGA17)
Figure 11: Actual throughput (normalized) for various N
22 Jialiang Zhang and Jing Li 2017. Improving the Performance of OpenCL-based
IPGA Accelerator for Convolutional Neural Network. In Proceedings of the 20
ACMSIGDA Intl Symposium on Field-Programmable Gate Arrays(FPGA '17)
126
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.