寒武纪 AI 指令集论文寒武纪发布在ISCA 2016上的一篇论文，设计了一个通用的神经网络指令集

文件名称: 寒武纪 AI 指令集论文

所属分类: 深度学习

开发工具:

文件大小: 836kb

下载次数: 0

上传时间: 2019-03-02

提供者: u0118*****

下载 (836kb)

不能下载？报告错误

详细说明：寒武纪发布在ISCA 2016上的一篇论文，设计了一个通用的神经网络指令集；Table i. an overview to cambricon instructions Instruction Type Examples Operands Control jump, conditional branch register (scalar value), immediate Matrix matrix load/store/move register (matrix address/size, scalar value), immediate Data Transfer Vector vector load/store/move register (vector addresssize, scala alue), immediate Scalar scalar load/store/mo register(scalar value), immediate Matrix matrix multiply vector, vector multiply matrix, matrix register(matrix/vector address/size,S- multiply scalar, outer , matrix add matrix, matrix calar value) Computational Vector vector elementary arithmetics(add, subtract, multiply. T register (vector address/size, scalar divide) logarithmic), dot product, random vector generator, naximum/minimum of a vector ilar elementary arithmetics, scalar transcendental register(scalar value), immediate functions ogic Vector vector compare (greater than, equal), vector logical register(vector address/size, scalar) operations(and, or, inverter), vector greater than merge Scalar scalar compare, scalar logical operations register(scalar), immediate with load/store instructions. Cambricon contains 64 32-bit support matrix and vector computational/logical instructions General-Purpose Registers(GPRs) for scalars, which can be (see Section III for such instructions ). Specifically, these used in register-indirect addressing of the on-chip scratchpad instructions can load/store variable-size data blocks(spec- memory, as well as temporally keeping scalar data ified by the data-width operand in data transfer instructions) Type of Instructions. The Cambricon contains four types rom/o the main memory to/froin the on-chip scratchpad of instructions: computational, logical, control, and data memory, or move data between the on-chip scratchpad transfer instructions. Although different instructions may memory and scalar GPRs Fig. 2 illustrates the Vector LOAD differ in their numbers of valid bits the instruction length (VLOAD) instruction, which can load a vector with the size s fixed to be 64-bit for the memory alignment and for the of Vsize from the main memory to the vector scratchpad design simplicity of the load/store/decoding logic. In this memory, where the source address in main memory is the section we only offer a brief introduction to the control sum of the base address saved in a gpr and an immediate and data transfer instructions because they are similar to number. The formats of Vector STORE (VSTORE), Matrix heir corresponding MIPs instructions, though have been LOAD(MLOAD), and Matrix STORE (MSTORE) instruc adapted to fit nn techniques. For computational instructions tions are similar with that of yload (including matrix, vector and scalar instructions and logical opcode RegoRe instructions, however, the details will be provided in the next load Dest addr v size Src base Src offset section(Section IID Figure 2. Vector Load (VLOAD) instruction structions, jump and conditional branch, as illustrated in A. On-chip Scratchpad Memory. Cambricon does not use Control Instructions. The Cambricon has two control ig. any vector register file, but directly keeps data in on- 1. The jump instruction specifies the offset via either an chip scratchpad memory, which is made visible to pro immediate or a GPr value, which will be accumulated to the grammers/compilers. In other words, the role of on-chip Program Counter(PC). The conditional branch instruction scratchpad memory in Cambricon is similar to that of vector specifies the predictor(stored in a GPR) in addition to register file in traditional ISAs, and sizes of vector operands the offset, and the branch target(either PC+offset or are no longer limited by fixed-width vector register files PC+ 1)is determined by a comparison between the predictor Therefore, vector/matrix sizes are variable in Cambricon and zero instructions. and the only notable restriction is that the vector/matrix operands in the same instruction cannot exceed 8 50/24 opcode RegO/Immed the capacity of scratchpad memory. In case they do exceed the compiler will decompose long vectors/matrices into short 」UMP Offset 38/12 pieces/blocks and generate multiple instructions to process deRegoRegl/Immed them Offse Just like the 32x512b vector registers have been baked into Figure 1. top Jump instruction. bottom: Condition Branch Intel AVX-512 [18], capacities of on-chip memories for both (CB)instruction vector and matrix instructions must be fixed in Cambricon Data Transfer Instructions. Data transfer instructions in More specifically, Cambricon fixes the memory capacity to Cambricon support variable data size in order to flexibly be 64KB for vector instructions, 768KB for matrix instruc 395 tions. Yet, Cambricon does not impose specific restriction Reg0 specifies the base scratchpad memory address of the on bank numbers of scratchpad memory, leaving significant vector output(Voutaddr); Reg l specifies the size of the vector Freedon to microarchitecture-level implementations. output (Vout size Reg 2, Reg 3, and Reg 4 specify the base ll. COMPUTATIONAL/LOGICAL INSTRUCTIONS ddress of the matrix input (Minaddr), the base address of the vector input (vinaddr, and the size of the vector In neural networks, most arithmetic operations (e.g, input (Vin size, note that it is variable, respectively. The additions, multiplications and activation functions) can be MMV instruction can support matrix-vector multiplication aggregated as vector operations [10],145 1, and the ratio at arbitrary scales, as long as all the input and output data can be as high as 99.992% according to our quantitative can be kept simultaneously in the scratchpad memory. We observations on a state-of-the-art convolutional neural net- choose to compute Wx with the dedicated MMV instruction work (GoogleNet) winning the 2014 ImageNet competition instead of decomposing it as multiple vector dot products (ILSVRC14)[43]. In the meantime, we also discover that because the latter approach requires additional efforts( e.g 99.791%0 of the vector operations (such as dot product explicit synchronization, concurrent read/write requests to operation) in the GoogLeNet can be aggregated further as the same address)to reuse the input vector x among different matrix operations( such as vector-matrix multiplication). In row vectors of M. which is less efficient. a nutshell. nns can be naturally decomposed into scalar 6 vector, and matrix operations, and the isa design must effec- ode rego Regl Reg2 Reg3 Reg4 tively take advantages of the potential data-level parallelism AMv Vaut addr Vout size Min add- Vin addr Vin size and data locality Figure 4. Matrix Mult Vector (MMV) instruction. A. Matrix instructions Unlike the feedforward case. however the mmv instruc- tion no longer provides efficient support to the backforward training process of an NN. More specifically, a critical step o f the well-known Back-Propagation (BP)algorithm is to compute the gradient vector [20], which can be formulated as a vector multiplied by a matrix. If we implement it with the mmv instruction we need an additional instruction )(x2)(X2 +1 implementing matrix transpose, which is rather expensive Figure 3. Typical operations in NNS in data movements. To avoid that, Cambricon provides a We conduct a thorough and comprehensive review to Vector-Mult-Matrix (VMM) instruction which is directly existing NN techniques, and design a total of six matrix applicable to the backforward training process. The VMM instructions for Cambricon, here we take a multi-Level instruction has the same fields with the mmv instruction Perceptrons(MLP)[50], a well-known and representative except NN, as an exanple, and show how it is supporled by Moreover, in training an nn. the weight matrix W often the matrix instructions. Technically, an MLP usually has needs to be incrementally updated with W-W+nAw, multiple layers, each of which computes values of some where n is the learning rate and Aw is estimated as the neurons (i.e, output neurons) according to some neurons outer product of two vectors. Cambricon provides an Outer whose values are known (i.e, input neurons ) We illustrate Product (OP)instruction(the output is a matrix), a Matrix- the feedforward run of one such layer in Fig. 3. More Mult-Scalar(MMS)instruction, and a Matrix-Add-Matrix specifically, the output neuron vi(i=1, 2, 3)in Fig. 3 can (MAM) instruction to collaboratively perform the weight be computed as y=5(23- Iw;j K; +bi ) where y is the j- updating. In addition, Cambricon also provides a Matrix th input neuron, wii is the weight between the i-th output Subtract-Matrix (MSM) instruction to support the weight neuron and the j-th input neuron, bi is the bias of the i-th updating in Restricted Boltzmann Machine (RBM)[39] output neuron, and/ is the activation function. The output neurons can be computed as a vector y=(1, 32, 33) B. vector instructions Using Eq. 1 as an example one can observe that the y=f(Wx+b), matrix instructions defined in the prior subsection are still where x=(x1, x2, x3 ) and b=(b1, b2, b3)are vectors of input insufficient to perform all the computations. We still need to neurons and biases, respectively, W=(wij) is the weight add up the vector output of wx and the bias vector b, and matrix, and f is the element-wise version of the activation then perform an element-wise activation to Wx+b function f(see Section III-B) While Cambricon directly provides a Vector-Add-Vector A critical step in Eq. l is to compute Wx, which will be (VAV) instruction for vector additions, it requires multiple performed by the Matrix-Mult-Vector(MMV) instruction in instructions to support the element-wise activation. Without Cambricon. We illustrate this instruction in Fig. 4, where losing any generality, here we take the widely-used sigmoid activation, f(a)=e/(1+ea), as an example. The element output feature map wise sigmoid activation performed to each element of an input vector(say, a) can be decomposed into 3 consecutive Multiple Output steps, and are supported by 3 instructions, respectively 1. Computing the exponential e i for each element(ai,i 1, .. n) in the input vector a. Cambricon provides ax ling window tip e Input nput feature map feature map a vector-Exponential (VEXP)instruction for element wise exponential of a vector 2. Adding the constant 1 to each element of the vector (ei, ..., ean). Cambricon provides a Vector-Add-Scalar (VAS)instruction, where the scalar can be an immediate or specified by a GPr 3. Dividing ei by 1+eli for each vector index i-1,.,n Cambricon provides a Vector-Div-Vector (VDV) in struction for element -wise division between vectors However, the sigmoid is not the only activation function utilized by the existing NNs. To implement element-wise Output Output versions of various activation functions, Cambricon provides a series of vector arithmetic instructions. such as vector Figure 5. Max-pooling operation Mult- Vector (VMV). Vector-Sub-Vector (VSV), and Vector Logarithm a Vector-Greater-Than-Merge (VGTM)instruction, see Fig VLOG).During the design of a hardware accelerator, 6. The VGTM instruction designates each element of the instructions related to different transcendental functions(e. g. output vector(Vout)by comparing corresponding elements logarithmic, trigonometric and anti-trigonometric functions) of the input vector-0(Vino) and input vector-I(vinD),1.e can efficiently reuse the same functional block (involv- Vout=(Vino> Vinl[il)?Vino]: Vinli]. We present the ing addition, shift, and table-lookup operations), using the Cambricon code of the max-pooling operation in Section CORDIC technique [24]. Moreover, there are activation I-E, which aggregates neurons at the same position of functions (e.g, max(0, a) and a) that partially rely on all input teature maps in the same input vector, iteratively logical operations (e.g, comparison ) and we will present performs VGTM and obtains the final result(see also Fig the related Cambricon instructions(e.g. vector compare 5c for an illustration instructions) in Section III-C In addition to the vector computational instruction, Cam Furthermore,the random vector generation is an im- bricon also provides Vector-Greater-than (VGT), Vector portant operation common in many NN techniques (e.g, Equal instruction (VE), Vector AND/OR/NOT instructions dropout [8] and random sampling [391), but is not deemed (VAND/VOR/VNOT), scalar comparison, and scalar logical as a necessity in traditional linear algebra libraries designed instructions to tackle branch conditions, i.e., computing the for scientific computing(e.g, the BLAS library does not predictor for the aforementioned Conditional Branch(CB) include this operation). Cambricon provides a dedicated Instruction instruction(Random-Vector, Rv) that generates a vector of opcode Rego Reg R random numbers obeying the uniform distribution at the interval 0, 1. Given uniform random vectors, we can further VGIM Vout addr vout size ving addr vinl addr generate random vectors obeying other distributions (e. g Figure 6. Vector Greater Than Merge (VGTM) instruction Gaussian distribution) using the Ziggurat algorithm [31] D. Scalar instructions with the help of vector arithmetic instructions and vector compare instructions in Cambricon Although we have observed that only 0.008% arithmetic operations of the GoogLeNet [43] cannot be supported with C. Logical Instructions matrix and vector instructions in Cambricon there are also he state-of-the-art nN techniques leverage a few oper- scalar operations that are indispensable to NNS, such as ations that incorporate comparisons or other logical ma- elementary arithmetic operations and scalar transcendental nipulations. The max-pooling operation is one such op- functions. We summarize them in Table I, which have been eration(see Fig. 5a for an illustration), which seeks the formally defined as Cambricon's scalar instruction neuron having the largest output among neurons within a pooling window, and repeats this action for corresponding E. Code Examples pooling windows in different input feature maps( see Fi To illustrate the usage of our proposed instruction sets 5b). Cambricon supports the max-pooling operation with we implement three simple yet representative components of NNS, a MLP feedforward layer [501, a pooling layer [22], MLP code: IV. A PROTOTYPE ACCELERATOR //S0: input size, $1: output size, $2: matrix size // $3: input address, $4: weight address Buffer //s5: bias address, S6: output address //$7-510: temp variable address 11 Cache VLOAD53,50,#100 /7 load input vector from address(100 MLOAD S4,52,#300 // load weight matrix from address(300) 匹∽ MMV $7, S1, $4, $3, S0 //Wx ector Func. Unit Scratchpad VAV s8,51,S755 //tmp=Wx+b Vector DMAs) VEXP $9, 51, $8 ∥/exp(tmpl VAS $10, $1, $9, #1 // 1+exp(tmp) Matrⅸ Matrix FunIc Unit VDV S6,S1,S9,S10 /y=exp(tmp/l+exp(tmp) Scratchpad (Matrix DMAs VSICRE S6,51,#200 7 store output vector to address(200) Pooling code: gure 8. A prototype accelerator based on Cambricon //S0: feature map size, $1: input data size, In this ion, we present a prototype accelerator o //52: output da Cambricon. We illustrate the design in Fig 8, which contains //S4: x-axis loop num, $5: y-axis loop num seven major instruction pipeline stages: fetching decod //S6: input addr, $7: output add // S8: y-axis stride of input ing, issuing, register reading, execution, writing back, and committing. We use mature techniques such as scratchpad VLOAD $5, S1, #100 / load input neurons from address (100) memory and dma in this accelerator, since we found that VE $5, $3 /i init y these classic techniques have been sufficient to reflect the L1: VGTM S7,S0,56,57 fexibility (Section V-B1), conciseness (Section V-B2) and //V feature map m, output(m=(input(x][y][m]output(mJ)? efficiency ( Section V-B3) of the isA. We did not seek input[x][y]m]: output[m] // update input address to explore the emerging techniques(such as 3D stacking SADD 4,#-1 [51] and non-volatile memory [47],[46])in our prototype #1,54 // if(x>0) goto L1 design, but left such exploration as future work, because we // update input address believe that a promising ISa must be easy to implement and SADD55,S5,#-1 L0,S5 7/ if(y>0l got should not be tightly coupled with emerging techniques VSTORE $7, $2, #200 / stroe output neurons to address(20 As illustrated in Fig. 8, after the fetching and decoding BM code stages, an instruction is injected into an in-order issue queue //S0: visible vector size, S1: hidden vector size, $2: V-h matrix(W)size After successfully fetching the operands(scalar data,or //$3: h-h matrix(L)size, $4: visible vector address, $5: W address //$6: L address, $7: bias address, $8: hidden vector address address/size of vector/matrix data) from the scalar register //$9-$17: temp variable address file, an instruction will be sent to different units dependin on the instruction type. Control instructions and scalar LOAD54,50,#100 //load visible vector from address(100) computational/logical instructions will be sent to the scalar VLOAD S9,51,#200 // load hidden vector from address(200) MLOAD S5,52,#300 // load W matrix from address( 300) functional unit for direct execution. After writing back to MLOAD S6, 53, #400 / load L matrix from address(400) the scalar register file such an instruction can be committed MMV510,斗1,55,54,50/Wv from the reorder buffer as long as it has become the oldest MMV uncommitted yet executed instruction s12,S1,S10,511+h $13, $1, $12, $7 / tmp=W+Lh+b Data transfer instructions, vector/matrix computational VEXP / exp(tmp instructions, and vector logical instructions, which may 5151,514,#11exo(tmp) access the LI cache or scratchpad memories will be sent to $16, $1, $14, $15 //yexp(tmp)/1+cxp(ti S17S1 // Vi, r[= random(0, 1) the Address Generation Unit (AGU). Such an instruction VGT58,51,517,16y,h=(r>y[)21 needs to wait in an in-order memory queue to resolve VSTORE $8, $1,#500 /i store hidden vector to address(500) potential memory dependencies with earlier instructions Figure 7. Cambricon program fragments of MLP, pooling in the memory queue. After that, load/store requests of and bm calar data transfer instructions will be sent to the ll and a Boltzmann Machines (BM)layer [39], using Cam- cache, data transfer/computational/logical instructions for bricon instructions. For the sake of brevity, we omit scalar vectors will be sent to the vector functional unit, data load/store instructions for all three layers, and only show the transfer/computational instructions for matrices will be sent to matrix functional unit. After the execution , such an program fragment of a single pooling window(with multiple input and output feature maps)for the pooling layer. W I We need a reorder buffer even though instructions are in-order issued illustrate the concrete Cambricon program fragments in Fig because the exccution stages of diffcrent instructions may take significantly 7, and we observe that the code density of Cambricon is different numbers of cycles. Here we say two instructions are memory dependent if they access an significantly higher than that of x86 and MIPS (see Section overlapping memory region, and at least one of them needs to write the V for a conprehensive evaluation) memory region. instruction can be retired from the memory queue, and then Change Dump (VCD)file. We are planning an MPw tape- be committed from the reorder buffer as long as it ha out of the prototype accelerator with a small area budget of become the oldest uncommitted yet executed instruction 60 mm at a 65nm process with targeted operating frequency The accelerator implements both vector and matrix func- of l Ghz. Therefore, we adopt moderate functional unit sizes tional units. The vector unit contains 32 16-bit adders, 32 and scratchpad memory capacities in order to fit the area 16-bit multipliers, and is equipped with a 64KB scratchpad budget. II shows the details of design parameters memory. The matrix unit contains 1024 multipliers and 1024 adders, which has been divided into 32 separate Table Il. Parameters of our prototype accelerator computational blocks to avoid excessive wire congestion issue width and power consumption on long-distance data movements depth of issue queue Each computational block is equipped with a separate 24KB depth of memory qucuc depth of reorder buffer scratchpad. The 32 computational blocks are connected capacity of vector scratchpad 64K through an h-tree bus that serves to broadcast input values memory to each block and to collect output values from each block capacityof matrix scratchpad[ 768KB (24KB X 32) menor a notable cambricon feature is that it does not use any bank width of scratchpad mem- 512 bits(32 x 16-bit fixed point) vector register file, but keeps data in on-chip scratchpad memories. To efficiently access scratchpad memories, the operators in matrix function unit 1024 (32x32)multipliers adders vector/matrix functional unit of the prototy pe accelerator operators in vector function unit 32 multipliers dividers integrates three dMAs, each of which corresponds to one adders transcendental func vector/matrix input/output of an instruction. In addition tion operators the scratchpad memory is equipped with an IO DMA. Baselines. We compare the Cambricon-ACC with three However, each scratchpad memory itself only provides a baselines. The first two are based on general-purpose CPU single port for each bank, but may need to address up to and GPU, and the last one is a state-of-the-art NN hardware four concurrent read/write requests. We design a specific accelerator structure for the scratchpad memory to tackle this issue(see CPU. The cpu baseline is an x86-CPU with 256-bit SIMd Fig. 9). Concretely, we decompose the memory into four support (Intel Xeon E5-2620, 2.10GHz, 64 GB memory) banks according to addresses' low-order two bits. connect We use the Intel MKL library [19 to implement vector them with four read/write ports via a crossbar guaranteeing and matrix primitives for the cpu baseline, and gCc that no bank will be simultaneously accessed. Thanks to v4.7. 2 to compile all benchmarks with options -O2-Im the dedicated hardware support, Cambricon does not need -march=native to enable SIMD instructions expensive multi-port vector register file, and can flexibly and GPU. The GPu baseline is a modern GPU card (NVidi efficiently support different data widths using the on-chip A K40M, 12GB GDDR5, 4.29 TFlops peak at a 28nm scratchpad memory process); we implement all benchmarks(see below)with Matrix the nVidia cuBLAS library [35], a state-of-the-art linear DMA DMA algebra library for GPU Port O Port 1 Port 2 NN Accelerator The baseline accelerator is DaDian Crossbar Nao. a state-of-the-art nn accelerator exhibiting remarkable energy-efficiency improvement over a GPU [5]. We re Hank-00 Bank-o1 Bank-10 Bank-1l implement the DaDianNao architecture at a 65nm process Figure 9. Structure of matrix scratchpad memory but replace all eDRAMs with SRAMs because we do not have a 65nm eDRAM library. In addition, we re-size daDi V. EXPERIMENTAL EVALUATION anNao such that it has a comparable amount of arithmetic In this section we first describe the evaluation methodol operators and on-chip SraM capacity as our design, which ogy, and then present the experimental results enables a fair comparison of two accelerators under our area budget(<60 mm) mentioned in the previous paragraph A. Methodology The re-implemented version of DaDian Nao has a single central tile and a total of 32 leaf tiles. The central tile has Design evaluation. We synthesize the prototype accelera- 64KB SRAM, 32 16-bit adders and 32 16-bit multipliers tor of Cambricon(Cambricon-ACC, see Section IV) with Each leaf tile has 24KB SRAM, 32 16-bit adders and 32 Synopsys Design Compiler using TSMC 65nm GP standard 16-bit multipliers. In other words, the total numbers of VT library, place and route the synthesized design with adders and multipliers, as well as the total SraM capacity the synopsys icc compiler, simulate and verify it with in the re-implemented daDian Nao. are the same with our ynopsys VCs, and estimate the power consumption with prototype accelerator. Although we are constrained to give Synopsys Prime-Time PX according to the simulated Value up eDRAMs in both accelerators, this is still a fair and reasonable experimental setting, because the flexibility of ISAs. On average, the code length of Cambricon is about an accelerator is mainly determined by its ISA, not concrete 6.41x, 9.86x, and 1338x shorter than GPU, x86, and MIPs, devices it integrates. In this sense, the flexibility gained from respectively. The observations are not surprising, because Cambricon will still be there even when we resort to large Cambricon aggregates many scalar operations into vector eDRaMs to remove main memory accesses and improve the instructions, and further aggregates vector operations into performance for both accelerators matrix instructions, which significantly reduces the code Benchmarks We take 10 representative nn techniques as length our benchmarks. see Table l. each benchmark is translated Specifically, on MIP, Cambricon can improve the code manually into assemblers to execute on Cambricon-ACC and density by 1362x, 22. 62x, and 32 92x against GPU, X86 DaDianNao. We evaluate their cycle-level performance with and MIPs, respectively. The main reason is that there are Synopsys VCS very few scalar instructions in the Cambricon code of MLP However, on CNN, Cambricon achieves only 1.09x, 5.90x, B. Experimental results and 8.27x reduction of code length against GPu, x86, and We compare Cambricon and Cambricon-ACC with the MIPS, respectively. It is because that the main body of baselines in terms of metrics such as performance and Cnn is a deeply nested loop requiring many individual energy. We also provide the detailed layout characteristics scalar operations to manipulate the loop variable. Hence, of the prototype accelerator the advantage of aggregating scalar operations into vector 1) Flexibility: In view of the apparent flexibility provided operations has a small gain on code density. by general-purpose ISAs(e.g, X86, MIPS and GPU-ISA), Moreover. we collect the percentage breakdown of Cam here we restrict our discussions to ISas of nn accelerators. bricon instruction types in the 10 benchmarks. On average DaDian Nao [5] and Dian Nao [3] are the two unique nn 38.0% instructions are data transfer instructions, 4.8% in accelerators that have explicit ISAs(other ones are often structions are control instructions, 12.6% instructions are hard wired). They share similar ISas, and our discussion is matrix instructions, 33.8% instructions are vector instruc exemplified by DaDian Nao, the one with better performance tions, and 10.9 instructions are scalar instructions. This and multicore scaling. To be specific, the Isa of this observation clearly shows that vector/matrix instructions accelerator only contains four 512-bit VLIW instructions play a critical role in nN techniques, thus efficient imple corresponding to four popular layer types of neural networks mentations of these instructions are essential to the perfor (fully-connected classifier layer, convolutional layer, pooling mance of an Cambricon-based accelerator layer, and local response normalization layer), rendering 3)Performance: We compare Cambricon-ACC against it a rather incomplete ISa for the nn domain. Among x86-CPU and GPU on all 10 benchmarks listed in Table Ill 10 representative benchmark networks listed in Table Ill, Fig. 12 illustrates the speedup of Cambricon-ACC against the DaDian Nao isa is only capable of expressing MLP, X86-CPU, GPU, and DaDianNao. On average, Cambricon CNN, and rbM, but fails to implement the rest 7 bench- ACC is about 91 /2x and 3. 09x faster than of X86-CPU marks (RNN, LSTM, AutoEncoder, Sparse Auto Encoder, and GPU, respectively. This is not surprising because BM, SOM and HNN). An observation well explaining the Cambricon-ACC integrates dedicated functional units and failure of DaDian nao on the 7 representative networks is scratchpad memory optimized for nn techniques that they cannot be characterized as aggregations of the four On the other hand due to the incomplete and restricted types of layers(thus aggregations of DaDian Nao instruc- ISA, DaDian Nao can only accommodate 3 out of 10 bench tions). In contrast, Cambricon defines a total of 43 64-bit marks (i.e, MLP, CNN and RBM), thus its flexibility is sig- scalar/control/vector/matrix instructions, and is sufficiently nificantly worse than that of Cambricon-ACC. In the mean flexible to express all 10 networks time, the better fexibility of Cambricon-ACC does not lead 2)Code Density. Code density is a meaningful ISA to significant performance loss. We compare Cambricor metric only when the ISa is flexible enough to cover a broad ACC against DaDian Nao on the three benchmarks that range of applications in the target domain. Therefore, we DaDian Nao can support, and observe that Cambricon-ACC only compare the code density of Cambricon with GPU, is only 4.5% slower than DaDian Nao on average. The MIPS, and x86, with 10 benchmarks implemented with reason for a small performance loss of Cambricon-ACO Cambricon, CUDA-C, and C, respectively. We manually over DaDianNao is that, Cambricon decomposes complex write the Cambricon program; We compile the CUDA-C high-level functional instructions of DaDian Nao (e.g,an programs with nvcc, and count the length of the generated instruction for a convolutional layer)into shorter and low ptx files after removing initialization and system-call in- level computational instructions(e. g, MMV and dot produc structions; We compile the C programs with x86 and MIPs t), which may bring in additional pipeline bubbles between compilers, respectively(with the option-O2). We then count instructions. With the high code density provided by Cambri the lengths of two kinds of assemblers. We illustrate in con, however, the amount of additional bubbles is moderate Fig. 10 Cambricon's reduction on code length over other the corresponding performance loss is therefore negligible 400 Table Ill. Benchmarks(H stands for hidden layer, C stands for convolutional layer, K stands for kernel, P stands for poolin layer, F stands for classifier layer, V stands for visible layer Technique Network structure Description MLP inpu(64)-H(l50)-H2(150)- Output(14 Using Multi-Layer Perceptron(MLP)to perform anchorperson detection. [2 CNN input(132x32)- C1(628x28, K: 65x5) Convolutional neural network (LeNet-5)for SI(614x14, K: 2x2)-C2(16( 10x10. K: hand-written character recognition. [281 165X5)-S2(165X5,K:2x2)-F(120)-F(84) output(1O) RNN input(26)-H(93)-output(61) Recurrent neural network (RNN) on TIMIT LSTM input(20)-H(93)-output(61) Long-short-time-memory (LSTM) neural net work on TIMIT database [15] autoencoder input(320)-HI(200)-H2(100)-H3(50)-Out- A neural network pretrained by auto-encoder on pu(10) MNiST data set.[49 Sparse Autoencoder input(320)-H1(200)-H2(100)-H3(50)-Out- A ncural network pretrained by sparse auto- t(10 encoder on mnist data set. [49] BM V(500)-H(500 Boltzmann machines(bm on minst data set RBM V(500)-H(500) Restricted boltzmann machine (rBM)on MINST SOM input data(64)-neurons(36) Self-organizing maps (som based data minin HNN ector(5), vector component(1o0 Hopfield neural network (hNn on hand-written digits data set. 136] B GPU/Cambrican X86-CPU/Cambricon MIPS-CpU/Cambricon 的的的》沙 Figure 10. the reduction of code length against gpu, x86-cPu, and MIPS-CPu a Data Transfer Control Matrix Vector scalar 解的》的的 Figure 11. The percentages of instruction types among all benchmarks BX8E-CPU/Cambricon-ACC GPU/Cambricon-ACC DaDian Nao/ Cambricon-ACC 10 的的的的 Figure 12. The speedup of Cambricon-ACC against x86-CPU, GPU, and DaDian Nao 4)Energy Consumption: We also compare the energy the core vector part and matrix part consume 8.20%0, and consumptions of Cambricon-ACC, GPU and DaDian Nao, 59.26% power, respectively. Moreover, data movements in which can be estimated as products of power consumptions the channel part consume 32.54% power, which is several (in Watt)and the execution times (in Second). The power times higher than the power of the core vector part. It can consumption of GPu is reported by the nvprof, and the be expected that the power consumption of the channel part power consumptions of DaDian Nao and Cambricon-ACc can be much higher if we do not divide the matrix part into are estimated with Synopsys Prime- Tame PX according to multiple blocks the simulated value Change dump (vcd) file. We do not have the energy comparison against CPU baseline, because Table Iv. Layout characteristics of Cambricon-ACC ( of the lack of hardware support for the estimation of the GHZ), implemented in TSMC 65nm technology actual power of the CPU. Yet, recently it has been reported that an SIMD-CPU is an order-of-magnitude less energy Component Area(um) (0) Power(mw) (%) efficient than a GPU (NVidia K2OM)on neural network Whole Chip 56241000 100% 169500 100% 50625009.00% 139.048.20% applications [4], which well complements our experiments Core vector Mat 352598406269% 10048159.26 As shown in Fig. 13, the energy consumptions of GP 1591866028.31% 551.7532.54% and daDian nao are 130, 53x and 0.9 16x that of Cambricon Combinational 1808148232.15% 476.9728.13% ACC, respectively, where the energy of DaDian Nao is av- Memory 846144515.05% 174.1410.27% Registers 56l285998% 300.2917.71% eraged over 3 benchmarks because it can only accommo Clock network 744.2043.89% date 3 out of 10 benchmarks. Compared with Cambricon Filler Cell 2320786241.26% ACC, the power consumption of GPU is much higher, as he GPu spends excessive hardware resources to flexibly support various workloads. On the other hand, the energy consumption of Cambricon-ACC is only slightly higher than of daDian nao, because both accelerators integrate the same Core sizes of functional units and on-chip storage, and work at the same frequency. The additional energy consumed by Cambricon-ACC mainly comes from instruction pipeline Matr马 logic, memory queue, as well as the vector transcendental functional unit. In contrast, DaDian Nao uses a low-precision Figure 14. The layout of Cambricon-ACC, implemented in but lightweight lookup table instead of using transcendental TSMC 65nm technology functional units 5)Chip Layout: We show the layout of Cambricon-ACC VI. POTENTIAL EXTENSION TO BROADER TECHNIQUES in Fig. 14, and list the area and power breakdowns in Table Although Cambricon is designed for existing neural net- IV. The overall area of Cambricon-ACC is 56.24 mm, work techniques, it can also support future neural network which is about 1.6% larger than of DaDian Nao (55.34 mm, techniques or even some classic statistical techniques, as re-implemented version). The combinational logic (mainly long as they can be decomposed into scalar/ector/matrix vector and matrix functional units) consumes 32. 15% area instructions in Cambricon. Here we take logistic regression of Cambricon-ACC, and the on-chip memory(mainly vector [21] as an example, and illustrate how it can be supported and matrix scratchpad memories)consumes about 15.05% by Cambricon. Technically, logistic regression contains two area phases, training phase, and prediction phase. The training The matrix part (including the matrix function unit and phase employs a gradient descent algorithm similar to the the matrix scratchpad memory) accounts for 62.69%area of training phase of MLP technique, which can be supported Cambricon-ACC, while the core part (including the instruc- by Cambricon. In the prediction phase, the output can be tion pipeline logic, scalar function unit, memory queue, and so on)and the vector part (including the vector function unit computed as y=sigmoid( 2 0 x;(where x=(xo, x1.x, and the vector scratchpad memory) only account for 9.00 is the input vector, xo always equals to 1,6=(6,9(n) So area. The remaining 28.31%area is consumed by the is the model parameters). We can leverage the dot product channel part, including wires connecting the core vector instruction, scalar elementary arithmetic instructions, and part and the matrix part, and wires connecting together scalar exponential instruction of Cambricon to perform the different blocks of the matrix part prediction phase of logistic regression. Moreover, given a We also estimate the power consumption of the prototype batch of n different input vectors, the mmv instruction, vec- design with Synopsys PrimePower. The peak power con- tor elementary arithmetic instructions and vector exponential sumption is 1.695 W(under 100%c toggle rate), which is only instruction in Cambricon collaboratively allow prediction about one percentage of the K40M GPU. More specifically, phases of n inputs to be computed in parallel

(系统自动生成,下载前可以参看下载内容)