文件名称:
glow编译器,降低了计算图之间的计算量
开发工具:
文件大小: 824kb
下载次数: 0
上传时间: 2019-07-27
详细说明:具体描述Glow编译器的基础知识,glow是通过减少计算图的计算量来优化的have implemented a high-level intermediate represen
Variable
name: save saveLl
tation that allows a compiler to reason about and
Value: 0.000000e+C0 output: floaK<2x 1>
optimize high-level constructs such as tensors and
Result float<2x 1>
users
Result
operations.
Glow is a retargetable compiler that supports a
number of diffcrent backends. This mcans that the
RHS
Splat
first few phases of the compiler are target-independent
name: learningRateSplat
me: div grad rhs
but as you get closer to instruction selection the IR. be
RHS
Result: Hoat<2 x 1>
users: 1
comes more target-specific. This design is not unique
Resul
Result float
to glow. Many compilers and virtual machines use
esult
similar techniques to gradually canonicalize, optimize
and lower programs into instruction streams. The
LHS
RHS
Variable
name:"A
first two lcvcls of ir arc shared bct
lL
LHS float<2 x 1
visibility public
lation targets. Coinpiler backends Imlay iMplement
Init: xavier
val:1000000e+00
additional levels of intermediate representations
Result fcat<2 x 1>users: 4
3.2 High-Level IR
RHS
RHS
Add
D
The high-level IR is a dataflow node-based graph
me: saveD
LHS foat<2
HS: float<2 x 1>
representation that is similar to the graph that you
RHS foat<2 x 1>
RHS: Aoat22 x 1>
visibility: public
uscrs: I
users: 2
Illay find inside caile. When we load a neural network
Result ncat<2 x 1>
Result: float< 2 x>
Output
model from some file we construct this graph with a
Resu
direct translation of one operator to one or more node
The high-level IR is a simple graph that allows b
Output
nput
transformations such as replacing all uses of some
name save
name: save saveDI
Input float <2x 1>
Input: float<2 x I>
node with another node and modifying the content
Output: floaK<2 x I>
foat< 2 x 1>
user;: 0
of variables. The graph is strongly typed, which
Ineans that inputs and output have a known tensor Figure 2: A lowered compute graph in Glow's high-level
type(consisting of the tensor's shape and element IR, representing the expression "A/B", automatically
typc), and that the types of nodes arc verified by differentiated by Glow
the compiler. For example, the element-wise add
instruction must operate on operands of the same
functions. Nodes inside functions are able to refer
Some strongly-typed programming languages repre- cncc variables which arc owned by thc module. Glow
sent dynamic tvpes at runtime in a safe way. Swift 19
functions contain nodes that represent the different
generics are an example for such type system that operations of a neural net work. The functions own
allows compilation for unknown yet constrained types. the nodes and have access to the variables in the
We have considered the idea of developing SOIne kind module. A module may have multiple functions. For
of parametric tensor types to support features such example, one module could contain both an inference
as varying batch sizes. However, we have decided to function and the gradient of that inference function
implement a simple strict type system instead and The gradient function could perform training of the
let the high-level machine learning framework spe- weights variables, and the inference function could
cialize the computation before constructing the Glow read from those same weights variables
graph. We evaluated the mechanisms that the mod
The compiler has a debug method for dumping
ern programming languages use to implement generics textual and graphical representations of the graph
and concluded that most hardware accelerators do Figure 2] depicts the compute graph that represents
not support somc of thcsc mcchanisms. Production thc cxprcssion "A/B". Thc graph is automatically
systems that use Glow may generate multiple glow differentiated by Glow, and the value of variable A
graphs for different batch sizes, or recompute the
updated with the gradient of the expression. Glow
graph just-in-time
lowers the nodes that compute the gradient of the
The Glow graph is structured as a module that con- expression and the stochastic gradient descent(SGD)
tains multiple functions that contain multiple nodes
node into a sequence of low-level operators(Div, Mul
Variables, which are similar to global variables inC Add and Save. The different compiler backends do
programs, are persistent tensors shared between the not need to implement support for the Div grad, Re-
3
LUGrad or sgd nodes
for the cpu. once for some mobile dsp accelerator
By contrast, classic machine learning frameworks and so on. This approach does not scale as the num-
that are not able to automatically generate fused ber of opcodes and the number of hardware targets
kernels (Section 5.2) need to implement hundreds
Increase
of CUDA and CPU compute kernels that represent
Instead, Glow takes a different approach. Instead
the un-lowcred opcrators. This limits thcir ability to
of compiling thc high-lcvcl opcrators directly, Glow
support new kinds of hardware and ties the to one performs "node lowering. In this phase, the compiler
or two major hardware vendors
breaks the high-level operator nodes into low-level
linear algebra operator nodes. For example, the full
3.3 Variable Visibility
Connected layer is represented as a matrix multiplica
tion followed by broadcasted add. Different compiler
Glow variables are similar to Py Torch and Tensor Flow backends do not have to implement the Fully con-
variables. Thcy arc pcrsistent tensors that livc across nccted laycr and a dozen other high-levcl opcodes,
different executions of the neural Network. Variables just the low-level Latrix Inlultiplication
are annotated with Public or private labels. These
This lowering phase drives many of the design deci-
labels specify whether the node is visible outside of sions of the compiler. In Glow, lowering is performed
the graph. If the node is public, then it means that as part of the high-level graph as described above
C++ code from outside the graph may access the
prior to moving to low-level IR(Section 3. 6). This is
variable directly and change its content before or
due to a number of reasons first the new lowered
after the execution of the program. This means that graph may allow for additional graph-level optimiza-
the optimizer is not allowed to delete unused public tions. Second, the new graph structure may affect the
variables or change their dimensions. However, in the
decisions of the instruction scheduler. And third. after
case of private variables, the optimizer is allowed to
lowering we allow the backends to perform additional
delete unused variables, transpose, perform constant target-specific optimizations on the lowered graph
propagation, etc
The lowering pl
hase comes after the graph is diff
entiated. Because the lowering transformation does
3.4 Predication
not preserve the semantics of the graph, it is not pos-
sible to differentiate the graph for certain operators
Predication is a well-known technique to control the For example, the Regression node(which produces
execution of some node or instruction by means of a gradient when optimizing total squared error) be-
boolean flag. If the value of the flag at runtime is set comes a no-op for the inference case, but is translated
to'false' then the predicated node or instructions may into an element-wise subtract for the training case
return any value. A correct program should know to Performing the lowering before differentiation would
ignore the output of the predicated instruction be- prevent us from performing the correct lowering of
cause it could be zeros or uninitialized memory. The
the regression node
type of the flag must be a boolean value or a vector
of booleans that matches the batch size. Predicates
3. 6 Low-Level IR
could accelerate the perforInance of soine networks by
avoiding some computation. It can particularly be use- After optimizing the graph with target-independent
ful when applied to Rccurront Ncural Nctworks 20
optimizations, and lowering from high-level operator
because different elements of the batch may have dif- nodes to linear algebra operator nodes, the code
ferent lengths and do not need to perform the same
further lowered into the low-level IR. in a phase that
amount of computation
is called IRGen"(which stands for IR generation)
This is a one-to-many translation where each high
3.5 Node Lowering
level node is translated into one or more instructions
The low-level IR enables a different kind of target
The Glow compilation pipclinc solves thc problcm indcpcndent optimizations that arc not possible with
of targeting a large number of opcodes to many dif- the high-level graph format. This is an instruction-
ferent targets. Modern machine learning framework
ased representation that operates on tensors that
support hundreds of operators on many different hard
are referenced by address. This gives the compiler
ware backends. The approach that is taken by classic the ability to perform low-level memory optimizations
machine learning frameworks is to implement each that are not possible at the high-level, because mem-
opcode for each hardware target. In such frameworl
ory is not represented directly. An example of such a
RelU would be implemented once for the GPU, once transformation is the optimization that allows certain
operations to transform some buffers in-place, such
declare i
as element-wise arithmetic
Input weight float<8 x 28 x 28 x 1>
broadcast. oo
In the context of hardware acceleration, the loy
%filter -weight float <16X 5 x 5 x 1
lcvcl instruction-bascd representation allows the com-
avier 25.o
Diler to represent device-specific operations such as
Zfiltero weight float<16>, broadcast,
asynchronous DMA operations. Hiding the latency
Weights weight float <10 x 144>, xavier
of memory operations is important for utilizing the
144.0
execution units of the hardware effectively, and the
%bias weight float <10>, broadcast, 0.100
instruction-based representation allows the compiler
%selected weight index<8 x 1
to create a schedule that hides the latency of the
hresult weight float<8 x 10>
memory operations
The IR is strongly typed and cach instruction
operand kind has known parameter types. It is de-
program t
Pallo alloc float<8 x 28 x 28 x 16>
signed to be used as an in-memory form, though can
conv convolution [5 1 216 out allo
be dumped to human readable assembly-like format
oin %input, oin filter3, oin %bias
%allo0 alloc float<8 x 28 x 28 x 16>
A function in iR form contains two sections: 'de
%relu maxo out %alloo, in 'allo
clare'and 'program. In the first section of the IR we
Pallo1= alloc index<8 x 9 x 16 2>
declare a nuInber of Menory regions that live through
allo2 alloc float<8 x 9 x 16>
Pool pool max [33 0] out %allo2,in
out the lifetime of the program. This is similar to
alloo, inout %allo1
global variables in C. The sccond part of the IR is a
list of instructions. Each variable
stated with
%deal dealloc out %allo6
the kind of initialization that the program should do
Adeal7 dealloc out %allo7
Adeal8 dealloc out %allo8
There are two kinds of memory regions which cor
ldeal9 dealloc out %allo9
respond to these two sections: global memory re
gions(found in declare)and locally allocated regions
Figure 3: Unoptimized low-level Glow IR
(found in prograIn). The locally allocated MenOry
regions are similar to 'alloca in LLVM IR Memory
rogions arc strongly typed, which mcans that thc kind 3.7 Summary: The Lifetime of a glow
of type of tensor that the region represents is known
Instruction
Instructions operate on either global variables or
locally allocated buffers. Each operand is annotated
This section summarizes how instructions travel from
with one of the qualifiers in/out/inout. 'in
the beginning of the compilation pipeline, and through
means that the buffer is read from, Qout means that
the different levels of ir and to the backends. This is
the buffer is written into. and inout'mleauls that
a high-level overview of the compilation process
the instruction may read and write into the buffer
These operand qualifiers help the optimizer decide
1. The graph is either loaded via the graph loader
when it is legal to perform certain optimizations, such
(from ONNX or Caffe 2 format), or constructed
as copy elimination or buffer sharing. Instructions
via the c++ interface
may have other attributes that specify the legality of
2. The graph is differentiated if needed
some optimizations. For example, some instructions
3. The graph is optimized
require that the data froin the forward pass would be
kept around for the backward pass, so if the program
4. Lincar algebra nodc lowering takes placc
is not optimized for infcrcncc-only modc then ccrtain
5. Additional rounds of optimizations occur, both
memory optimizations cannot happen
target independent and target specific
Figure3 shows an example of unoptimized Glow
6. The graph is scheduled into a linear sequence of
IR. Note that the alloc' instruction does not allocate
nodes that minimizes memory usage
memory; it just marks the lifetime of the activation
The low-level memory allocator is responsible for allo-
7. IRGen converts the low-level graph into instruc
cating all of the buffers into a single coalesced region
tions
8. LOW-level TR optimizations are performed
http://llvm.org/docs/langref.html#
9. Backend-specific optimizations and code genera
alloca-instruction
tion are performed
5
BB. newInstr("AvgPool")
a quantized tensor' s type is made up of the underlying
addOperand("Dest", OperandKind:: Out)
element type(Int8), as well as the possible range of
ddOperand("Src", OperandKind:: In)
addMember(MemberType: SizeT, "Kernel")
the values in the tensor using " scale'andoffset'fields
ddMember(Member Type:: SizeT, "Stride")
To COlvert fronn the 8-bit integer range of [-128..127
addMember(Member Type: SizeT, "Pad")
to the floating-point number that they represent, Glow
autoIrgen o
uscs the following convcrsion formula
autoVerify(verifykind:: SameElementType
value -(input -offset)* scale
addGradientInstr(["Dest"]
i"Dest", "Src"h)
Activations, weights, and variables all use the same
Figure 4: Exanple of the class-gen for the Average Pool
type-systelnl and represent inforIatiOn ill a ulliforIn
instruction
wa
4.2 Profile-Guided Quantization
3.8 ClassEn
Different parts of the network contain floating-point
Glow uses automatic code generation techniques values in different ranges. In some parts, the typical
(class-gen) for defining instructions and nodes. The range of the numbers is between zero and one, while
purpose of the automatic code generation tools in in other parts of the network the possible range is in
similar to the motivation behind LLVM's the hundreds. Choosing a single conversion scale for
TableGen, which is to help a human develop and the whole network would not work, because a single
maintain records of domain-specific information
scale value could be iInprecise for sInlall values alld
The current system is capable of generating two truncate large values
kinds of classes: nodes for the high-level IR and In-
We use profile-guided information to estimate the
structions for the low-level IR. Figure 4 shows an possible numeric range for each stage of the neural
exanple of the code for generating the AvgPool
network. Our quantization conversion works using a
struction. ClassGen generates most of the methods two-phase process. First, we statically instrument the
that instructions need to have, such as instruction network with special profiling nodes that record the
equality and hashing, cloning, printing, verification, ranges of activations that flow in the network, opti-
to
mize the network including these profiling nodes, and
then run inference. Then, wc recompile the nctwork
using this profile information to convert the network
4 Quantization
nto a quantized form, allowing for static optimization
of the quantized graph. We convert portions of the
the context of machine learning: quantization is network into islands of integer computation and aim
thc proccss of converting thc nural nctwork from to generate outputs in the range that the original
floating-point arithmetic to integer arithmetic. Arith
floating-point network produces. Figure 5 shows a
metic using small integers is more efficient than the
quantized subgraph from Rcsnct50
computation of full-width floating-point numbers, and
additionally decreases memory usage
4.3 Compiler Optimizations For
Glow is able to convert foa- point-based net.
works into signed 8-bit integer networks. The canoni
Quantization
cal quantization representation is using signed inte- Glow features a number of compiler optimizations
gers, though it is possible to support other quantiza- that transform the compute graph and make it more
tion formats. Glow uses profile-guided quantization, efficient. There are a few classes of optimizations and
observing execution during inference to estimate the
parameters to optimize
possible numeric range for each stage of the neural
First, we attempt to minimize the number of con
network. Training-based quantization is considered versions bctwccn floating-point tensors and integer
future work
tensors, in both directions. Some operations, such as
transpose' operate on both types, and
4.1 Tensor Representation
changing the representation can minimize conversions
Second. the neural network contains 'rescale' nodes
In Glow, tensors are typed and can represent floats, that change the range of the integers. These nodes
quantized non-floating-point values such as currently are required to convert between numeric ranges that
supported Int8( 8-bit signed integers), and index types. mimic the original floating-point network. He
owever
6
The cPu backend needs to generate code for machine
pu::float :x 3 x 224 x 224:
visibility Ful
learning operators such as Convolution and Soft max
Outrut
One possibility is to call into some external library
such as Eigel. This is easy to do, alld mally inlachine
0640-662.2597031k<64
learning frameworks use this technique. The disad
nanc: quantizc246
lmu; fou
compilation process, Glow loads the bitcode from disk
as:i[5003649.60l-2.259,7.031k64>
and specializes the operator implementations for the
specific context. Glow replaces function arguments
Resalt;i8[S0.0410-12810.0008.703k<1x112x112x6
that represent the dimensions of some tensor or buffer
addresses with constants that llVM can optimize
to generate efficient code. The compiler can decide
Figure 5: A quantized subgraph from Resnet50
on the kind and level of operator specialization to
perform, and trade compile time and binary size for
performance
in many cases, it is possible to fold the rescale oper
Most operators are very simple and the LLvm vec
ations into numeric-producing operations, and elimi- torizer 11 is able to generate very efficient code
nate the
Notice that by providing the exact tensor dimensions
Third, it's possible to rescale the values in the net
and loop trip count the vectorizer is able to generate
work in order to a llow fast hardware implement a tions
efficient code that does not contain pre-header legality
of the quantized operations. For example, consider check and scalar loop to handle the remainder odd
the max operations. By converting both sides of the iterations. The convolution and matrix multiplica-
maxinto the same scale we allow the hardware to tion operations are hand-optimized in C++ using the
perform a simple comparison. By normalizing both clang extended OpenCL vector syntax, and LLVM
sides of thc ' max'opcration to thc samc scalc we does a good job allocating registers and encoding the
enable this efficient optimization
nstructions, removing the need to use inline assembly
5 cPU Backend
5.2 Operator Stacking
One inportant optinization that the CPu backend
This section describes the implementation of the CPu implements is stacking of data-parallel operators. Con
backend. The Glow CPU backend compiles the low
sider a sequence of operators that operate one element
lcvcl intermediate reprcscntation into an optimized at a time, for example a reLU, Add, Sub Iterating
stream of instructions. It uses Llvm to optimize and over a large buffer multiple times is inefficient because
emit machine code and was tested on x86 and ARM64. it requires the cpu to load the memory multiple times
The backend can emit a stand-alone object file to disk each time invalidating the whole cache. Instead, Glow
or execute code in just-in-time mode. The backend stacks operators and performs a few data-parallel
emits debug information, which makes it possible to
perators one after the other on the same memory
debug glow in a debugger and place a breakpoint in
location Notice that as dcscribcd abovc. this is not
spccific opcrator, or undcrstand the performance of an optimization that LLVM can perform by itself and
networks using a profiler
it requires a special high-level data structure
Operator stacking is similar to operator fusion
5.1 Standard library
Ilowever, when fusing multiple operators(e. g. Conv
and RelU fused together), all backends that want to
One interesting aspect of the Glow CPU backend is support this fused operator must implement a specific
the use of a small target independent standard library
ernel ror
each permutation of operators. In contrast
Filter layout before transformation
[depth, filter_x, filter-y, channel]
output: float
Filter layout after transformation
lity public
[depth/N, filter_x, filter, channel, N]
Output
Figure 6: Transformation of a convolution s filter's mem
ory layout to optimize for SIMD memory accesses. Depth
refers to the output depth of the filter, and channel refers
mamut: noa K> cupar: nak 3a6 visiblity private
to the input channel
users: I
al:0.C00000e+00
Result: Hoat
Output
users:I
Glow's stacking automatically creates such kernels
all of the
ble
p
data-parallel node
CPUConvdKKc8
arc automatically fuscd into a fast kernel
con
The approach of stacking multiple operations has
Filter float<8 x7x7x3x3>
Bias: float<64>
many advantages. First, there is an immediate perfor
mance gain for places in the graph where data parallel
users
Rcsult foat
operators are placed one on top of the other. Sec-
Result
ond, backends do not need to implement kernels for
all possible permutations of consecutive data-parallel
nodes. And lastly, it allows Glow to lower high-level
name: relu49
operators knowing that the backend can fuse them
put: float
and recover the performance
foat 1 x 112x 112 x 642
For example, Glow lowers the SGD(stochastic gra
Result
dient descent) operator into a sequence of low-level
primitives that include addition, subtraction, and
multiplication. Lowering the SGD nodc into low-levcl
Figure 7: A subgraph from Resnet50 optimized for the
primitives simplifies the design of the compiler by
cPU backend. the cpuconydkkc8 node has had it
reducing the opera. tor-space that the backend needs
memory layout modified for efficient SiMD access(F
to handle. Operator stacking can also accelerate com
ure6p. Note that CPUMaxSplat is also a CPU-backend
putation on GPUs by reducing the kernel launch over
specific node that performs a Max operation with some
scalar inpu
head
5.3 Use Case: Optimizing Resnet50
ifferent compilation strategy. Next, the target-
for the cPu
specific optimizer mutates the graph and generates
code that matches the selected convolution each
In this section, we describe the way that Glow opti- convolution kind uses a different filter memory layout
mizcs Rcsnct50 to gcncratc an cfficicnt strcam of x86 and tile size. Figure 6 depicts the transformed filter
instructions.Resnet50 is a residual convolutional neu- memory layout
ral network that contains 54 convolutions as well as
This 5-dimensional tensor layout allows for con-
other operators such as element-wise addition, RelU
secutive SIMD memory access. The n parameter is
batch normalization, max and average pooling Fully- selected based on the iteration order and the block-
Connected, and SoftMax. Glow optimizes Resnet50 ing strategy for the convolutiOn. The CPU backend
by performing both high-level and low-level optimiza- traverses the graph and replaces any convolutions it.
tions
would like to optimize in this way with this specialized
First, high-level transformations eliminate redun-
convolution. This can be seen in Figure
dant transpose operations and merge the batch nor-
The second parameter that the compiler controls
malization opcration with a convolution nodc. Next,
is the size of the convolution tile. Glow selects a
the cpu backend transforms the graph into a target- processing tile that depends on the size of the first
specific graph that allows device-specific optimization. level cache of the processor
The cpu backend identifies three kinds of convolu
Next, the low-level optimizer optimizes the instruc
tions: convolutions with a small number of channels, tion stream by shrinking the lifetime of memory allo-
convolutions where the size of the input activation
cations for the activations, and then performs static
buffer is large, and convolutions where the filter weight memory allocation for the whole network into a single
buffer is large. Each one of these convolutions requires buffer. This reduces the mutable memory footprint
■ TensorFlow17XA
LBB141
vmovaps 3211264(grcxrax, 4).ymm1
vmovaps 3211296(%rcx, Trax, 4),%ymm2
15.5fp
vmovaps 3211328(%rcx, %rax, 4),%ymm3
vaddps 6422528(%rcx,%rax,4),%ymm1, %ymm1
vaddps 6422560(%rcx, %rax, 4),%ymm2,%ymm2
vmovaps 3211360(rcx, Trax, 4), 7ymm4
vaddps 6422592(%rcx, Trax, 4),7ymm3, 7ymm3
vaddps 6422624(%rcx, %rax, 4), %ymm4,%y mm4
vmaxps %mmo, %ymm1, %ymm1
vmaxps % mmo, %ymm2, %ymm 2
6.1 fps
vmaxps y ymmO, %ymm3, h ymm3
vmovaps % ymm1, 6422528(rcx, trax, 4)
2.9 fps
vmovaps %ymm2, 6422560(%rcx, brax, 4)
vmaxps %mmo, %ymm4, %ymm1
vmovaps %ymm 3, 6422592(%rcx, %rax, 4)
Resnet50
VGG19
vmovaps %ymm1, 6422624 (%rcx, % rax, 4)
addq $32, brax
Figure 9: Glow vs. Tensor Flow-17 on all Intel core
Fig
A loop with a fused element-wise addition and 17-7567U; frames per second on a single core
RelU (max)operation
Conclusion
of the network. From this point in the compilation This paper presented the design of Glow
pipeline the compiled code can refer to pointers in
learning compiler for heterogeneous hardware. Glow
menory
lowers the compute graph of neural networks to multi
Finally, the compiler performs efficient code gen-
cration for thc non-convolution parts of thc nctwork
level strongly-typed intermediate representations, en-
For example, Figure 8 depicts the gemerated assenbly
abling analyses and optimizations appropriate for cach
for some part of the network. The compiler fused
level to efficiently and scalably target many backends
two unrelated element-wise operations into a single
We hope our efforts will enable research in the area
loop. The Add and Max operations are performed on
of machine learning acceleration
the same memory buffer without reading the memory
twic
8 Acknowledgements
II addition to the core Glow teaIll, Illally fellow people
6 Evaluation
at Facebook have made contributions to the project
including Andrew Adams, Michel Aoun. Sarah Bird,
We coMpare perforlllalce of Glow alld TelsorFlow-
Evan Cheng. Soumith Chintala. Chris Dewan. Utku
1.7 on three popular convolutional neural networks Diril, Marat Dukhan, Dmytro Dzhulgakov, Peter
listed in Figure 9 The benchmarks were executed Goldsborough, Kim Hazelwood. Yangqing Jia, Daya
on a Kaby Lake Intel Core i7-7567U(which does S Khudia, Howard Mansell, Erik Meijer, Maxim Nau-
not support AVX-512 running on a single CPU core. mov, Pieter Noordhuis, Joe Pamer, Lin Qiao, Vijay
Both TensorFlow and Glow were compiled to support
Rao. Martin Schatz. Alexander Sidorov. Andrew Tul-
the native architecture. Tensor Flow was compiled loch, Nicolas Vasilachc, Adam Wcis, and Hoctor Y
with XLA enabled. We used the Keras library (21
We also would like to thank Eli Bendersky, Chris
to supply and run pre-trained models for Tensor Flow. Leary, Richard Wei, and Tianqi Chen for the develop
Our benchmarks used a batch size of 8. Performance ment and release of their work to the open source
(in raines per second) did llot depend on the batch
size, i.e. tot al execution time scaled linearly wit h
References
batch size
As seen in Figure 9 Glow is up to 2. 5x faster than [1] John L. Henncssy and David A. Patterson. Computer
Tensor Flow. This is due to the fact that TensorFlow
Architecture, Sith Edition: A Quantitative Approach
:hapter 7. 2. 6t h edition, 2017
alls into Eigen which implements convolution using
the classic im2col followed by matrix multiplication
2 XLA: Domain-specific compiler for linear algebra to opti-
mizestensorflowcomputationshttps://www.tensorflow
while glow compiles direct convolution(Section 5
org/performance/xla
and thus avoids im2col overhead. In addition Glow
3 Adam Paszke, Sam Gross, Soumith Chintala, Gregory
performs shape-aware code-generation
Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
ban Desmaison Luca antiga and adam Lerer Automatic
Executor for Deep Learning. ArX2 e-prints, January
differentiation in Py Torch. In NIPS-W, 2017.
4YangqingJia,EvanShelhamer,JeffDonahue,Sergey[17]IntelMkl.https://software.inte
el. com/en-us/mk1
Karayev, Jona. t han Long, Ross girshick, Sergio Guadar-
rama, and Trevor Darrell. Caffe: Convolutional Architec
[18N. Vasilache, O. Zinenko, T. Theodoridis, P Goyal, Z De
ture for Fast Feature Embedding. In Proceedings of the
Vito. W. S. Moses, S. Verdoolaege, A. Adams, and A. Co-
22nd ACM International Conference on Multimedia, MM
hen. Tensor Comprehensions: Framework-Agnostic Iligh-
'14, pages 675 678, Ncw York, NY, USA, 2014. ACM
Performance machine Learning abstractions. Arxiv e
prints, February 2018
5 Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng
Chell, Andy Davis, Jeffrey Deal, Matthieu Devin, sanjay
[19]swift.https://developer.apple.com/swift/
Ghemawat, Geoffrey Irving, Michael Isard, Manjunath [ 20) Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Kudlur, Josh Levenberg, Rajat Monga, Sherry moore,
Deep Learning
ItPress2016.http://www
Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay
deeplearningbook org
Vasudevan. Pete Warden. Martin Wicke. Yuan Yu and
Xiaoqiang Zheng. Tensor Flow: A System for Large-scale
[21]Keras:ThePythonDeepleArninglibraryhttps://
keras.io/
Machine Learning. In Proceedings of the 12th USENIX
Co
Systcms Do
nd Imples
tation, OSDI'16, pages 265-283, Berkeley, CA, USA, 2016
USHNIX ASSOciation
(6onnx.https://onnx.ai/
7 Frank Seide and Amit Agarwal. Cntk: Microsoft's open
source deep-learning toolkit. In Proceedings of the 22Nd
ACM SIGKDD Inter national Conference on Krowledge
Discovery and Data Mining, KDD '16, pages 2135-2135
New York. NY. USA. 2016. ACM
18eigen.http://eigen.tuxfamily.org
[9]NvidiACudnn.https://developer,nvidiacom/
cudnn
10 Chris La
nd Vikralnl Adve. LLVM: A CoInpilatiOI
Framework for Lifelong Program Analysis Transforma
tion. In Proceedings of the International Symposium on
Code Generation and Optimization: Feedback-directed and
Runtime Optimization, CGO04, pages 75-, Washington,
DC, USA, 2004. IEEE Computer Society
11 Nadav Rotem and Arnold Schwaighofer. Vectorization
inliVm.https://11vm.org/devmtg/2013-11/#talk10
2013 LLVM Developers meeting
1 12 University of Washington Paul G. Allen School of Com
puter Science &z Engineering, Amazon Web Service A
team, and DMLC open-source community. NNVM Com
pileropEnCompilerforAiFrameworkshttp://tvmlang
org/2017/10/06/nnvm-compiler-announcement html
13 T Chen, T. Moreau, Z Jiang, H. Shen, E. Yan, L. Wang,
Y Hu, L. Ceze, C Guestrin, and A Krishnamurthy TVM
End-to-End Optimization Stack for Deep learning ATX
e-prints, February 2018
[14] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams
Sylvain Paris, Fredo Durand, and Saman Amarasinghe
Halide: A Language and Compiler for Optimizing Paral-
lelism, Locality, and Recomputation in Image Processing
Pipelines. In Proceedings of the 34th ACM SIGPLAN
Conference on Programming Language De
plem...tion, PlII'13, pages 519-5.0. New York, NY
USA. 2013. ACM.
15 R. Wei, L. Schwartz, and V. Adve. DLVM: A modern
compiler infrastructure for deep learning systems. Ar Xiv
e-prints, November 2017
16S Cyphers, A. K. Bansal, A. Bhiwandiwalla, J. Bobba,
M. Brookhart, A Chakraborty, W. Constable, C. Conve
L. Cook, O. Kanawi, R. Kimball, J. Knight, N. Korovaiko
V. Kumar. Y. Lao. C. R. Lishka. Menon. J. Mvers
S. Aswath Naravana. A. Procter and T.. Webb. Intel
l Graph: An Intermediate Representation, Compiler, and
10
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.