开发工具:
文件大小: 836kb
下载次数: 0
上传时间: 2019-03-02
详细说明:寒武纪发布在ISCA 2016上的一篇论文,设计了一个通用的神经网络指令集;Table i. an overview to cambricon instructions
Instruction Type
Examples
Operands
Control
jump, conditional branch
register (scalar value), immediate
Matrix matrix load/store/move
register (matrix address/size, scalar
value), immediate
Data Transfer Vector vector load/store/move
register (vector addresssize, scala
alue), immediate
Scalar scalar load/store/mo
register(scalar value), immediate
Matrix matrix multiply vector, vector multiply matrix, matrix register(matrix/vector address/size,S-
multiply scalar, outer
, matrix add matrix, matrix calar value)
Computational Vector vector elementary arithmetics(add, subtract, multiply. T register (vector address/size, scalar
divide)
logarithmic), dot product, random vector generator,
naximum/minimum of a vector
ilar elementary arithmetics, scalar transcendental register(scalar value), immediate
functions
ogic
Vector vector compare (greater than, equal), vector logical register(vector address/size, scalar)
operations(and, or, inverter), vector greater than merge
Scalar scalar compare, scalar logical operations
register(scalar), immediate
with load/store instructions. Cambricon contains 64 32-bit support matrix and vector computational/logical instructions
General-Purpose Registers(GPRs) for scalars, which can be (see Section III for such instructions ). Specifically, these
used in register-indirect addressing of the on-chip scratchpad instructions can load/store variable-size data blocks(spec-
memory, as well as temporally keeping scalar data
ified by the data-width operand in data transfer instructions)
Type of Instructions. The Cambricon contains four types
rom/o the main memory to/froin the on-chip scratchpad
of instructions: computational, logical, control, and data
memory, or move data between the on-chip scratchpad
transfer instructions. Although different instructions may
memory and scalar GPRs Fig. 2 illustrates the Vector LOAD
differ in their numbers of valid bits the instruction length
(VLOAD) instruction, which can load a vector with the size
s fixed to be 64-bit for the memory alignment and for the
of Vsize from the main memory to the vector scratchpad
design simplicity of the load/store/decoding logic. In this
memory, where the source address in main memory is the
section we only offer a brief introduction to the control
sum of the base address saved in a gpr and an immediate
and data transfer instructions because they are similar to
number. The formats of Vector STORE (VSTORE), Matrix
heir corresponding MIPs instructions, though have been
LOAD(MLOAD), and Matrix STORE (MSTORE) instruc
adapted to fit nn techniques. For computational instructions
tions are similar with that of yload
(including matrix, vector and scalar instructions and logical
opcode RegoRe
instructions, however, the details will be provided in the next
load Dest addr v size Src base
Src offset
section(Section IID
Figure 2. Vector Load (VLOAD) instruction
structions, jump and conditional branch, as illustrated in A. On-chip Scratchpad Memory. Cambricon does not use
Control Instructions. The Cambricon has two control
ig. any vector register file, but directly keeps data in on-
1. The jump instruction specifies the offset via either an chip scratchpad memory, which is made visible to pro
immediate or a GPr value, which will be accumulated to the grammers/compilers. In other words, the role of on-chip
Program Counter(PC). The conditional branch instruction scratchpad memory in Cambricon is similar to that of vector
specifies the predictor(stored in a GPR) in addition to register file in traditional ISAs, and sizes of vector operands
the offset, and the branch target(either PC+offset or are no longer limited by fixed-width vector register files
PC+ 1)is determined by a comparison between the predictor Therefore, vector/matrix sizes are variable in Cambricon
and zero
instructions. and the only notable restriction is that the
vector/matrix operands in the same instruction cannot exceed
8
50/24
opcode RegO/Immed
the capacity of scratchpad memory. In case they do exceed
the compiler will decompose long vectors/matrices into short
」UMP
Offset
38/12
pieces/blocks and generate multiple instructions to process
deRegoRegl/Immed
them
Offse
Just like the 32x512b vector registers have been baked into
Figure 1. top Jump instruction. bottom: Condition Branch Intel AVX-512 [18], capacities of on-chip memories for both
(CB)instruction
vector and matrix instructions must be fixed in Cambricon
Data Transfer Instructions. Data transfer instructions in More specifically, Cambricon fixes the memory capacity to
Cambricon support variable data size in order to flexibly be 64KB for vector instructions, 768KB for matrix instruc
395
tions. Yet, Cambricon does not impose specific restriction Reg0 specifies the base scratchpad memory address of the
on bank numbers of scratchpad memory, leaving significant vector output(Voutaddr); Reg l specifies the size of the vector
Freedon to microarchitecture-level implementations.
output (Vout size Reg 2, Reg 3, and Reg 4 specify the base
ll. COMPUTATIONAL/LOGICAL INSTRUCTIONS
ddress of the matrix input (Minaddr), the base address
of the vector input (vinaddr, and the size of the vector
In neural networks, most arithmetic operations (e.g, input (Vin size, note that it is variable, respectively. The
additions, multiplications and activation functions) can be
MMV instruction can support matrix-vector multiplication
aggregated as vector operations [10],145 1, and the ratio at arbitrary scales, as long as all the input and output data
can be as high as 99.992% according to our quantitative can be kept simultaneously in the scratchpad memory. We
observations on a state-of-the-art convolutional neural net-
choose to compute Wx with the dedicated MMV instruction
work (GoogleNet) winning the 2014 ImageNet competition instead of decomposing it as multiple vector dot products
(ILSVRC14)[43]. In the meantime, we also discover that
because the latter approach requires additional efforts( e.g
99.791%0 of the vector operations (such as dot product
explicit synchronization, concurrent read/write requests to
operation) in the GoogLeNet can be aggregated further as the same address)to reuse the input vector x among different
matrix operations( such as vector-matrix multiplication). In row vectors of M. which is less efficient.
a nutshell. nns can be naturally decomposed into scalar
6
vector, and matrix operations, and the isa design must effec-
ode rego Regl Reg2 Reg3 Reg4
tively take advantages of the potential data-level parallelism
AMv Vaut addr Vout size Min add- Vin addr Vin size
and data locality
Figure 4. Matrix Mult Vector (MMV) instruction.
A. Matrix instructions
Unlike the feedforward case. however the mmv instruc-
tion no longer provides efficient support to the backforward
training process of an NN. More specifically, a critical step
o f the well-known Back-Propagation (BP)algorithm is to
compute the gradient vector [20], which can be formulated
as a vector multiplied by a matrix. If we implement it with
the mmv instruction we need an additional instruction
)(x2)(X2
+1
implementing matrix transpose, which is rather expensive
Figure 3. Typical operations in NNS
in data movements. To avoid that, Cambricon provides a
We conduct a thorough and comprehensive review to Vector-Mult-Matrix (VMM) instruction which is directly
existing NN techniques, and design a total of six matrix applicable to the backforward training process. The VMM
instructions for Cambricon, here we take a multi-Level
instruction has the same fields with the mmv instruction
Perceptrons(MLP)[50], a well-known and representative except
NN, as an exanple, and show how it is supporled by
Moreover, in training an nn. the weight matrix W often
the matrix instructions. Technically, an MLP usually has needs to be incrementally updated with W-W+nAw,
multiple layers, each of which computes values of some where n is the learning rate and Aw is estimated as the
neurons (i.e, output neurons) according to some neurons outer product of two vectors. Cambricon provides an Outer
whose values are known (i.e, input neurons ) We illustrate Product (OP)instruction(the output is a matrix), a Matrix-
the feedforward run of one such layer in Fig. 3. More Mult-Scalar(MMS)instruction, and a Matrix-Add-Matrix
specifically, the output neuron vi(i=1, 2, 3)in Fig. 3 can (MAM) instruction to collaboratively perform the weight
be computed as y=5(23- Iw;j K; +bi ) where y is the j-
updating. In addition, Cambricon also provides a Matrix
th input neuron, wii is the weight between the i-th output
Subtract-Matrix (MSM) instruction to support the weight
neuron and the j-th input neuron, bi is the bias of the i-th
updating in Restricted Boltzmann Machine (RBM)[39]
output neuron, and/ is the activation function. The output
neurons can be computed as a vector y=(1, 32, 33)
B. vector instructions
Using Eq. 1 as an example one can observe that the
y=f(Wx+b),
matrix instructions defined in the prior subsection are still
where x=(x1, x2, x3 ) and b=(b1, b2, b3)are vectors of input insufficient to perform all the computations. We still need to
neurons and biases, respectively, W=(wij) is the weight add up the vector output of wx and the bias vector b, and
matrix, and f is the element-wise version of the activation then perform an element-wise activation to Wx+b
function f(see Section III-B)
While Cambricon directly provides a Vector-Add-Vector
A critical step in Eq. l is to compute Wx, which will be (VAV) instruction for vector additions, it requires multiple
performed by the Matrix-Mult-Vector(MMV) instruction in instructions to support the element-wise activation. Without
Cambricon. We illustrate this instruction in Fig. 4, where losing any generality, here we take the widely-used sigmoid
activation, f(a)=e/(1+ea), as an example. The element
output feature map
wise sigmoid activation performed to each element of an
input vector(say, a) can be decomposed into 3 consecutive
Multiple Output
steps, and are supported by 3 instructions, respectively
1. Computing the exponential e i for each element(ai,i
1, .. n) in the input vector a. Cambricon provides
ax
ling window
tip e Input
nput feature map
feature map
a vector-Exponential (VEXP)instruction for element
wise exponential of a vector
2. Adding the constant 1 to each element of the vector
(ei, ..., ean). Cambricon provides a Vector-Add-Scalar
(VAS)instruction, where the scalar can be an immediate
or specified by a GPr
3. Dividing ei by 1+eli for each vector index i-1,.,n
Cambricon provides a Vector-Div-Vector (VDV) in
struction for element -wise division between vectors
However, the sigmoid is not the only activation function
utilized by the existing NNs. To implement element-wise
Output
Output
versions of various activation functions, Cambricon provides
a series of vector arithmetic instructions. such as vector
Figure 5. Max-pooling operation
Mult- Vector (VMV). Vector-Sub-Vector (VSV), and Vector
Logarithm
a Vector-Greater-Than-Merge (VGTM)instruction, see Fig
VLOG).During the design of a hardware accelerator, 6. The VGTM instruction designates each element of the
instructions related to different transcendental functions(e. g. output vector(Vout)by comparing corresponding elements
logarithmic, trigonometric and anti-trigonometric functions) of the input vector-0(Vino) and input vector-I(vinD),1.e
can efficiently reuse the same functional block (involv- Vout=(Vino> Vinl[il)?Vino]: Vinli]. We present the
ing addition, shift, and table-lookup operations), using the Cambricon code of the max-pooling operation in Section
CORDIC technique [24]. Moreover, there are activation I-E, which aggregates neurons at the same position of
functions (e.g, max(0, a) and a) that partially rely on
all input teature maps in the same input vector, iteratively
logical operations (e.g, comparison ) and we will present performs VGTM and obtains the final result(see also Fig
the related Cambricon instructions(e.g. vector compare 5c for an illustration
instructions) in Section III-C
In addition to the vector computational instruction, Cam
Furthermore,the random vector generation is an im- bricon also provides Vector-Greater-than (VGT), Vector
portant operation common in many NN techniques (e.g, Equal instruction (VE), Vector AND/OR/NOT instructions
dropout [8] and random sampling [391), but is not deemed (VAND/VOR/VNOT), scalar comparison, and scalar logical
as a necessity in traditional linear algebra libraries designed instructions to tackle branch conditions, i.e., computing the
for scientific computing(e.g, the BLAS library does not predictor for the aforementioned Conditional Branch(CB)
include this operation). Cambricon provides a dedicated Instruction
instruction(Random-Vector, Rv) that generates a vector of
opcode Rego Reg
R
random numbers obeying the uniform distribution at the
interval 0, 1. Given uniform random vectors, we can further
VGIM Vout addr vout size ving addr vinl addr
generate random vectors obeying other distributions (e. g
Figure 6. Vector Greater Than Merge (VGTM) instruction
Gaussian distribution) using the Ziggurat algorithm [31]
D. Scalar instructions
with the help of vector arithmetic instructions and vector
compare instructions in Cambricon
Although we have observed that only 0.008% arithmetic
operations of the GoogLeNet [43] cannot be supported with
C. Logical Instructions
matrix and vector instructions in Cambricon there are also
he state-of-the-art nN techniques leverage a few oper- scalar operations that are indispensable to NNS, such as
ations that incorporate comparisons or other logical ma- elementary arithmetic operations and scalar transcendental
nipulations. The max-pooling operation is one such op- functions. We summarize them in Table I, which have been
eration(see Fig. 5a for an illustration), which seeks the formally defined as Cambricon's scalar instruction
neuron having the largest output among neurons within a
pooling window, and repeats this action for corresponding E. Code Examples
pooling windows in different input feature maps( see Fi
To illustrate the usage of our proposed instruction sets
5b). Cambricon supports the max-pooling operation with we implement three simple yet representative components of
NNS, a MLP feedforward layer [501, a pooling layer [22],
MLP code:
IV. A PROTOTYPE ACCELERATOR
//S0: input size, $1: output size, $2: matrix size
// $3: input address, $4: weight address
Buffer
//s5: bias address, S6: output address
//$7-510: temp variable address
11 Cache
VLOAD53,50,#100
/7 load input vector from address(100
MLOAD S4,52,#300
// load weight matrix from address(300)
匹∽
MMV $7, S1, $4, $3, S0 //Wx
ector Func. Unit
Scratchpad
VAV
s8,51,S755
//tmp=Wx+b
Vector DMAs)
VEXP $9, 51, $8
∥/exp(tmpl
VAS $10, $1, $9, #1 // 1+exp(tmp)
Matrⅸ
Matrix FunIc Unit
VDV
S6,S1,S9,S10
/y=exp(tmp/l+exp(tmp)
Scratchpad
(Matrix DMAs
VSICRE S6,51,#200
7 store output vector to address(200)
Pooling code:
gure 8. A prototype accelerator based on Cambricon
//S0: feature map size, $1: input data size,
In this
ion, we present a prototype accelerator o
//52: output da
Cambricon. We illustrate the design in Fig 8, which contains
//S4: x-axis loop num, $5: y-axis loop num
seven major instruction pipeline stages: fetching decod
//S6: input addr, $7: output add
// S8: y-axis stride of input
ing, issuing, register reading, execution, writing back, and
committing. We use mature techniques such as scratchpad
VLOAD $5, S1, #100 / load input neurons from address (100)
memory and dma in this accelerator, since we found that
VE $5, $3
/i init y
these classic techniques have been sufficient to reflect the
L1: VGTM S7,S0,56,57
fexibility (Section V-B1), conciseness (Section V-B2) and
//V feature map m, output(m=(input(x][y][m]output(mJ)?
efficiency ( Section V-B3) of the isA. We did not seek
input[x][y]m]: output[m]
// update input address
to explore the emerging techniques(such as 3D stacking
SADD
4,#-1
[51] and non-volatile memory [47],[46])in our prototype
#1,54
// if(x>0) goto L1
design, but left such exploration as future work, because we
// update input address
believe that a promising ISa must be easy to implement and
SADD55,S5,#-1
L0,S5
7/ if(y>0l got
should not be tightly coupled with emerging techniques
VSTORE $7, $2, #200 / stroe output neurons to address(20
As illustrated in Fig. 8, after the fetching and decoding
BM code
stages, an instruction is injected into an in-order issue queue
//S0: visible vector size, S1: hidden vector size, $2: V-h matrix(W)size
After successfully fetching the operands(scalar data,or
//$3: h-h matrix(L)size, $4: visible vector address, $5: W address
//$6: L address, $7: bias address, $8: hidden vector address
address/size of vector/matrix data) from the scalar register
//$9-$17: temp variable address
file, an instruction will be sent to different units dependin
on the instruction type. Control instructions and scalar
LOAD54,50,#100
//load visible vector from address(100)
computational/logical instructions will be sent to the scalar
VLOAD S9,51,#200
// load hidden vector from address(200)
MLOAD S5,52,#300
// load W matrix from address( 300)
functional unit for direct execution. After writing back to
MLOAD S6, 53, #400
/ load L matrix from address(400)
the scalar register file such an instruction can be committed
MMV510,斗1,55,54,50/Wv
from the reorder buffer as long as it has become the oldest
MMV
uncommitted yet executed instruction
s12,S1,S10,511+h
$13, $1, $12, $7 / tmp=W+Lh+b
Data transfer instructions, vector/matrix computational
VEXP
/ exp(tmp
instructions, and vector logical instructions, which may
5151,514,#11exo(tmp)
access the LI cache or scratchpad memories will be sent to
$16, $1, $14, $15 //yexp(tmp)/1+cxp(ti
S17S1
// Vi, r[= random(0, 1)
the Address Generation Unit (AGU). Such an instruction
VGT58,51,517,16y,h=(r>y[)21
needs to wait in an in-order memory queue to resolve
VSTORE $8, $1,#500
/i store hidden vector to address(500)
potential memory dependencies with earlier instructions
Figure 7. Cambricon program fragments of MLP, pooling in the memory queue. After that, load/store requests of
and bm
calar data transfer instructions will be sent to the ll
and a Boltzmann Machines (BM)layer [39], using Cam-
cache, data transfer/computational/logical instructions for
bricon instructions. For the sake of brevity, we omit scalar
vectors will be sent to the vector functional unit, data
load/store instructions for all three layers, and only show the
transfer/computational instructions for matrices will be sent
to matrix functional unit. After the execution , such an
program fragment of a single pooling window(with multiple
input and output feature maps)for the pooling layer. W
I We need a reorder buffer even though instructions are in-order issued
illustrate the concrete Cambricon program fragments in Fig
because the exccution stages of diffcrent instructions may take significantly
7, and we observe that the code density of Cambricon is
different numbers of cycles.
Here we say two instructions are memory dependent if they access an
significantly higher than that of x86 and MIPS (see Section
overlapping memory region, and at least one of them needs to write the
V for a conprehensive evaluation)
memory region.
instruction can be retired from the memory queue, and then Change Dump (VCD)file. We are planning an MPw tape-
be committed from the reorder buffer as long as it ha
out of the prototype accelerator with a small area budget of
become the oldest uncommitted yet executed instruction
60 mm at a 65nm process with targeted operating frequency
The accelerator implements both vector and matrix func- of l Ghz. Therefore, we adopt moderate functional unit sizes
tional units. The vector unit contains 32 16-bit adders, 32 and scratchpad memory capacities in order to fit the area
16-bit multipliers, and is equipped with a 64KB scratchpad budget. II shows the details of design parameters
memory. The matrix unit contains 1024 multipliers and
1024 adders, which has been divided into 32 separate
Table Il. Parameters of our prototype accelerator
computational blocks to avoid excessive wire congestion
issue width
and power consumption on long-distance data movements
depth of issue queue
Each computational block is equipped with a separate 24KB
depth of memory qucuc
depth of reorder buffer
scratchpad. The 32 computational blocks are connected
capacity of vector scratchpad 64K
through an h-tree bus that serves to broadcast input values
memory
to each block and to collect output values from each block
capacityof matrix scratchpad[ 768KB (24KB X 32)
menor
a notable cambricon feature is that it does not use any
bank width of scratchpad mem- 512 bits(32 x 16-bit fixed point)
vector register file, but keeps data in on-chip scratchpad
memories. To efficiently access scratchpad memories, the
operators in matrix function unit 1024 (32x32)multipliers
adders
vector/matrix functional unit of the prototy pe accelerator
operators in vector function unit 32 multipliers dividers
integrates three dMAs, each of which corresponds to one
adders transcendental func
vector/matrix input/output of an instruction. In addition
tion operators
the scratchpad memory is equipped with an IO DMA. Baselines. We compare the Cambricon-ACC with three
However, each scratchpad memory itself only provides a baselines. The first two are based on general-purpose CPU
single port for each bank, but may need to address up to and GPU, and the last one is a state-of-the-art NN hardware
four concurrent read/write requests. We design a specific accelerator
structure for the scratchpad memory to tackle this issue(see CPU. The cpu baseline is an x86-CPU with 256-bit SIMd
Fig. 9). Concretely, we decompose the memory into four support (Intel Xeon E5-2620, 2.10GHz, 64 GB memory)
banks according to addresses' low-order two bits. connect We use the Intel MKL library [19 to implement vector
them with four read/write ports via a crossbar guaranteeing and matrix primitives for the cpu baseline, and gCc
that no bank will be simultaneously accessed. Thanks to v4.7. 2 to compile all benchmarks with options -O2-Im
the dedicated hardware support, Cambricon does not need -march=native to enable SIMD instructions
expensive multi-port vector register file, and can flexibly and GPU. The GPu baseline is a modern GPU card (NVidi
efficiently support different data widths using the on-chip A K40M, 12GB GDDR5, 4.29 TFlops peak at a 28nm
scratchpad memory
process); we implement all benchmarks(see below)with
Matrix
the nVidia cuBLAS library [35], a state-of-the-art linear
DMA
DMA
algebra library for GPU
Port O
Port 1
Port 2
NN Accelerator The baseline accelerator is DaDian
Crossbar
Nao. a state-of-the-art nn accelerator exhibiting remarkable
energy-efficiency improvement over a GPU [5]. We re
Hank-00
Bank-o1
Bank-10
Bank-1l
implement the DaDianNao architecture at a 65nm process
Figure 9. Structure of matrix scratchpad memory
but replace all eDRAMs with SRAMs because we do not
have a 65nm eDRAM library. In addition, we re-size daDi
V. EXPERIMENTAL EVALUATION
anNao such that it has a comparable amount of arithmetic
In this section we first describe the evaluation methodol
operators and on-chip SraM capacity as our design, which
ogy, and then present the experimental results
enables a fair comparison of two accelerators under our area
budget(<60 mm) mentioned in the previous paragraph
A. Methodology
The re-implemented version of DaDian Nao has a single
central tile and a total of 32 leaf tiles. The central tile has
Design evaluation. We synthesize the prototype accelera- 64KB SRAM, 32 16-bit adders and 32 16-bit multipliers
tor of Cambricon(Cambricon-ACC, see Section IV) with Each leaf tile has 24KB SRAM, 32 16-bit adders and 32
Synopsys Design Compiler using TSMC 65nm GP standard 16-bit multipliers. In other words, the total numbers of
VT library, place and route the synthesized design with adders and multipliers, as well as the total SraM capacity
the synopsys icc compiler, simulate and verify it with in the re-implemented daDian Nao. are the same with our
ynopsys VCs, and estimate the power consumption with prototype accelerator. Although we are constrained to give
Synopsys Prime-Time PX according to the simulated Value up eDRAMs in both accelerators, this is still a fair and
reasonable experimental setting, because the flexibility of ISAs. On average, the code length of Cambricon is about
an accelerator is mainly determined by its ISA, not concrete 6.41x, 9.86x, and 1338x shorter than GPU, x86, and MIPs,
devices it integrates. In this sense, the flexibility gained from respectively. The observations are not surprising, because
Cambricon will still be there even when we resort to large Cambricon aggregates many scalar operations into vector
eDRaMs to remove main memory accesses and improve the instructions, and further aggregates vector operations into
performance for both accelerators
matrix instructions, which significantly reduces the code
Benchmarks We take 10 representative nn techniques as
length
our benchmarks. see Table l. each benchmark is translated
Specifically, on MIP, Cambricon can improve the code
manually into assemblers to execute on Cambricon-ACC and
density by 1362x, 22. 62x, and 32 92x against GPU, X86
DaDianNao. We evaluate their cycle-level performance with
and MIPs, respectively. The main reason is that there are
Synopsys VCS
very few scalar instructions in the Cambricon code of MLP
However, on CNN, Cambricon achieves only 1.09x, 5.90x,
B. Experimental results
and 8.27x reduction of code length against GPu, x86, and
We compare Cambricon and Cambricon-ACC with the MIPS, respectively. It is because that the main body of
baselines in terms of metrics such as performance and Cnn is a deeply nested loop requiring many individual
energy. We also provide the detailed layout characteristics scalar operations to manipulate the loop variable. Hence,
of the prototype accelerator
the advantage of aggregating scalar operations into vector
1) Flexibility: In view of the apparent flexibility provided operations has a small gain on code density.
by general-purpose ISAs(e.g, X86, MIPS and GPU-ISA),
Moreover. we collect the percentage breakdown of Cam
here we restrict our discussions to ISas of nn accelerators. bricon instruction types in the 10 benchmarks. On average
DaDian Nao [5] and Dian Nao [3] are the two unique nn 38.0% instructions are data transfer instructions, 4.8% in
accelerators that have explicit ISAs(other ones are often structions are control instructions, 12.6% instructions are
hard wired). They share similar ISas, and our discussion is matrix instructions, 33.8% instructions are vector instruc
exemplified by DaDian Nao, the one with better performance tions, and 10.9 instructions are scalar instructions. This
and multicore scaling. To be specific, the Isa of this observation clearly shows that vector/matrix instructions
accelerator only contains four 512-bit VLIW instructions play a critical role in nN techniques, thus efficient imple
corresponding to four popular layer types of neural networks mentations of these instructions are essential to the perfor
(fully-connected classifier layer, convolutional layer, pooling mance of an Cambricon-based accelerator
layer, and local response normalization layer), rendering 3)Performance: We compare Cambricon-ACC against
it a rather incomplete ISa for the nn domain. Among x86-CPU and GPU on all 10 benchmarks listed in Table Ill
10 representative benchmark networks listed in Table Ill, Fig. 12 illustrates the speedup of Cambricon-ACC against
the DaDian Nao isa is only capable of expressing MLP, X86-CPU, GPU, and DaDianNao. On average, Cambricon
CNN, and rbM, but fails to implement the rest 7 bench- ACC is about 91 /2x and 3. 09x faster than of X86-CPU
marks (RNN, LSTM, AutoEncoder, Sparse Auto Encoder, and GPU, respectively. This is not surprising because
BM, SOM and HNN). An observation well explaining the Cambricon-ACC integrates dedicated functional units and
failure of DaDian nao on the 7 representative networks is scratchpad memory optimized for nn techniques
that they cannot be characterized as aggregations of the four
On the other hand due to the incomplete and restricted
types of layers(thus aggregations of DaDian Nao instruc- ISA, DaDian Nao can only accommodate 3 out of 10 bench
tions). In contrast, Cambricon defines a total of 43 64-bit marks (i.e, MLP, CNN and RBM), thus its flexibility is sig-
scalar/control/vector/matrix instructions, and is sufficiently nificantly worse than that of Cambricon-ACC. In the mean
flexible to express all 10 networks
time, the better fexibility of Cambricon-ACC does not lead
2)Code Density. Code density is a meaningful ISA
to significant performance loss. We compare Cambricor
metric only when the ISa is flexible enough to cover a broad ACC against DaDian Nao on the three benchmarks that
range of applications in the target domain. Therefore, we DaDian Nao can support, and observe that Cambricon-ACC
only compare the code density of Cambricon with GPU, is only 4.5% slower than DaDian Nao on average. The
MIPS, and x86, with 10 benchmarks implemented with reason for a small performance loss of Cambricon-ACO
Cambricon, CUDA-C, and C, respectively. We manually over DaDianNao is that, Cambricon decomposes complex
write the Cambricon program; We compile the CUDA-C
high-level functional instructions of DaDian Nao (e.g,an
programs with nvcc, and count the length of the generated instruction for a convolutional layer)into shorter and low
ptx files after removing initialization and system-call in- level computational instructions(e. g, MMV and dot produc
structions; We compile the C programs with x86 and MIPs t), which may bring in additional pipeline bubbles between
compilers, respectively(with the option-O2). We then count instructions. With the high code density provided by Cambri
the lengths of two kinds of assemblers. We illustrate in con, however, the amount of additional bubbles is moderate
Fig. 10 Cambricon's reduction on code length over other the corresponding performance loss is therefore negligible
400
Table Ill. Benchmarks(H stands for hidden layer, C stands for convolutional layer, K stands for kernel, P stands for poolin
layer, F stands for classifier layer, V stands for visible layer
Technique
Network structure
Description
MLP
inpu(64)-H(l50)-H2(150)- Output(14
Using Multi-Layer Perceptron(MLP)to perform
anchorperson detection. [2
CNN
input(132x32)- C1(628x28, K: 65x5) Convolutional neural network (LeNet-5)for
SI(614x14, K: 2x2)-C2(16( 10x10. K: hand-written character recognition. [281
165X5)-S2(165X5,K:2x2)-F(120)-F(84)
output(1O)
RNN
input(26)-H(93)-output(61)
Recurrent neural network (RNN) on TIMIT
LSTM
input(20)-H(93)-output(61)
Long-short-time-memory (LSTM) neural net
work on TIMIT database [15]
autoencoder
input(320)-HI(200)-H2(100)-H3(50)-Out- A neural network pretrained by auto-encoder on
pu(10)
MNiST data set.[49
Sparse Autoencoder input(320)-H1(200)-H2(100)-H3(50)-Out- A ncural network pretrained by sparse auto-
t(10
encoder on mnist data set. [49]
BM
V(500)-H(500
Boltzmann machines(bm on minst data set
RBM
V(500)-H(500)
Restricted boltzmann machine (rBM)on MINST
SOM
input data(64)-neurons(36)
Self-organizing maps (som based data minin
HNN
ector(5), vector component(1o0
Hopfield neural network (hNn on hand-written
digits data set. 136]
B GPU/Cambrican X86-CPU/Cambricon MIPS-CpU/Cambricon
的的的》沙
Figure 10. the reduction of code length against gpu, x86-cPu, and MIPS-CPu
a Data Transfer Control Matrix Vector scalar
解的》的的
Figure 11. The percentages of instruction types among all benchmarks
BX8E-CPU/Cambricon-ACC GPU/Cambricon-ACC DaDian Nao/ Cambricon-ACC
10
的的的的
Figure 12. The speedup of Cambricon-ACC against x86-CPU, GPU, and DaDian Nao
4)Energy Consumption: We also compare the energy the core vector part and matrix part consume 8.20%0, and
consumptions of Cambricon-ACC, GPU and DaDian Nao, 59.26% power, respectively. Moreover, data movements in
which can be estimated as products of power consumptions the channel part consume 32.54% power, which is several
(in Watt)and the execution times (in Second). The power times higher than the power of the core vector part. It can
consumption of GPu is reported by the nvprof, and the be expected that the power consumption of the channel part
power consumptions of DaDian Nao and Cambricon-ACc can be much higher if we do not divide the matrix part into
are estimated with Synopsys Prime- Tame PX according to multiple blocks
the simulated value Change dump (vcd) file. We do not
have the energy comparison against CPU baseline, because Table Iv. Layout characteristics of Cambricon-ACC (
of the lack of hardware support for the estimation of the GHZ), implemented in TSMC 65nm technology
actual power of the CPU. Yet, recently it has been reported
that an SIMD-CPU is an order-of-magnitude less energy
Component
Area(um)
(0) Power(mw)
(%)
efficient than a GPU (NVidia K2OM)on neural network
Whole Chip
56241000
100%
169500
100%
50625009.00%
139.048.20%
applications [4], which well complements our experiments
Core vector
Mat
352598406269%
10048159.26
As shown in Fig. 13, the energy consumptions of GP
1591866028.31%
551.7532.54%
and daDian nao are 130, 53x and 0.9 16x that of Cambricon
Combinational
1808148232.15%
476.9728.13%
ACC, respectively, where the energy of DaDian Nao is av-
Memory
846144515.05%
174.1410.27%
Registers
56l285998%
300.2917.71%
eraged over 3 benchmarks because it can only accommo
Clock network
744.2043.89%
date 3 out of 10 benchmarks. Compared with Cambricon
Filler Cell
2320786241.26%
ACC, the power consumption of GPU is much higher, as
he GPu spends excessive hardware resources to flexibly
support various workloads. On the other hand, the energy
consumption of Cambricon-ACC is only slightly higher than
of daDian nao, because both accelerators integrate the same
Core
sizes of functional units and on-chip storage, and work at
the same frequency. The additional energy consumed by
Cambricon-ACC mainly comes from instruction pipeline
Matr马
logic, memory queue, as well as the vector transcendental
functional unit. In contrast, DaDian Nao uses a low-precision
Figure 14. The layout of Cambricon-ACC, implemented in
but lightweight lookup table instead of using transcendental
TSMC 65nm technology
functional units
5)Chip Layout: We show the layout of Cambricon-ACC
VI. POTENTIAL EXTENSION TO BROADER TECHNIQUES
in Fig. 14, and list the area and power breakdowns in Table
Although Cambricon is designed for existing neural net-
IV. The overall area of Cambricon-ACC is 56.24 mm, work techniques, it can also support future neural network
which is about 1.6% larger than of DaDian Nao (55.34 mm, techniques or even some classic statistical techniques, as
re-implemented version). The combinational logic (mainly long as they can be decomposed into scalar/ector/matrix
vector and matrix functional units) consumes 32. 15% area instructions in Cambricon. Here we take logistic regression
of Cambricon-ACC, and the on-chip memory(mainly vector [21] as an example, and illustrate how it can be supported
and matrix scratchpad memories)consumes about 15.05% by Cambricon. Technically, logistic regression contains two
area
phases, training phase, and prediction phase. The training
The matrix part (including the matrix function unit and phase employs a gradient descent algorithm similar to the
the matrix scratchpad memory) accounts for 62.69%area of training phase of MLP technique, which can be supported
Cambricon-ACC, while the core part (including the instruc- by Cambricon. In the prediction phase, the output can be
tion pipeline logic, scalar function unit, memory queue, and
so on)and the vector part (including the vector function unit
computed as y=sigmoid( 2 0 x;(where x=(xo, x1.x,
and the vector scratchpad memory) only account for 9.00 is the input vector, xo always equals to
1,6=(6,9(n)
So area. The remaining 28.31%area is consumed by the is the model parameters). We can leverage the dot product
channel part, including wires connecting the core vector instruction, scalar elementary arithmetic instructions, and
part and the matrix part, and wires connecting together scalar exponential instruction of Cambricon to perform the
different blocks of the matrix part
prediction phase of logistic regression. Moreover, given a
We also estimate the power consumption of the prototype batch of n different input vectors, the mmv instruction, vec-
design with Synopsys PrimePower. The peak power con- tor elementary arithmetic instructions and vector exponential
sumption is 1.695 W(under 100%c toggle rate), which is only instruction in Cambricon collaboratively allow prediction
about one percentage of the K40M GPU. More specifically, phases of n inputs to be computed in parallel
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.