glow编译器，降低了计算图之间的计算量具体描述Glow编译器的基础知识，glow是通过减少计算图的

文件名称: glow编译器，降低了计算图之间的计算量

所属分类: 其它

开发工具:

文件大小: 824kb

下载次数: 0

上传时间: 2019-07-27

提供者: xiao_m******

下载 (824kb)

不能下载？报告错误

详细说明：具体描述Glow编译器的基础知识，glow是通过减少计算图的计算量来优化的have implemented a high-level intermediate represen Variable name: save saveLl tation that allows a compiler to reason about and Value: 0.000000e+C0 output: floaK<2x 1> optimize high-level constructs such as tensors and Result float<2x 1> users Result operations. Glow is a retargetable compiler that supports a number of diffcrent backends. This mcans that the RHS Splat first few phases of the compiler are target-independent name: learningRateSplat me: div grad rhs but as you get closer to instruction selection the IR. be RHS Result: Hoat<2 x 1> users: 1 comes more target-specific. This design is not unique Resul Result float to glow. Many compilers and virtual machines use esult similar techniques to gradually canonicalize, optimize and lower programs into instruction streams. The LHS RHS Variable name:"A first two lcvcls of ir arc shared bct lL LHS float<2 x 1 visibility public lation targets. Coinpiler backends Imlay iMplement Init: xavier val:1000000e+00 additional levels of intermediate representations Result fcat<2 x 1>users: 4 3.2 High-Level IR RHS RHS Add D The high-level IR is a dataflow node-based graph me: saveD LHS foat<2 HS: float<2 x 1> representation that is similar to the graph that you RHS foat<2 x 1> RHS: Aoat22 x 1> visibility: public uscrs: I users: 2 Illay find inside caile. When we load a neural network Result ncat<2 x 1> Result: float< 2 x> Output model from some file we construct this graph with a Resu direct translation of one operator to one or more node The high-level IR is a simple graph that allows b Output nput transformations such as replacing all uses of some name save name: save saveDI Input float <2x 1> Input: float<2 x I> node with another node and modifying the content Output: floaK<2 x I> foat< 2 x 1> user;: 0 of variables. The graph is strongly typed, which Ineans that inputs and output have a known tensor Figure 2: A lowered compute graph in Glow's high-level type(consisting of the tensor's shape and element IR, representing the expression "A/B", automatically typc), and that the types of nodes arc verified by differentiated by Glow the compiler. For example, the element-wise add instruction must operate on operands of the same functions. Nodes inside functions are able to refer Some strongly-typed programming languages repre- cncc variables which arc owned by thc module. Glow sent dynamic tvpes at runtime in a safe way. Swift 19 functions contain nodes that represent the different generics are an example for such type system that operations of a neural net work. The functions own allows compilation for unknown yet constrained types. the nodes and have access to the variables in the We have considered the idea of developing SOIne kind module. A module may have multiple functions. For of parametric tensor types to support features such example, one module could contain both an inference as varying batch sizes. However, we have decided to function and the gradient of that inference function implement a simple strict type system instead and The gradient function could perform training of the let the high-level machine learning framework spe- weights variables, and the inference function could cialize the computation before constructing the Glow read from those same weights variables graph. We evaluated the mechanisms that the mod The compiler has a debug method for dumping ern programming languages use to implement generics textual and graphical representations of the graph and concluded that most hardware accelerators do Figure 2] depicts the compute graph that represents not support somc of thcsc mcchanisms. Production thc cxprcssion "A/B". Thc graph is automatically systems that use Glow may generate multiple glow differentiated by Glow, and the value of variable A graphs for different batch sizes, or recompute the updated with the gradient of the expression. Glow graph just-in-time lowers the nodes that compute the gradient of the The Glow graph is structured as a module that con- expression and the stochastic gradient descent(SGD) tains multiple functions that contain multiple nodes node into a sequence of low-level operators(Div, Mul Variables, which are similar to global variables inC Add and Save. The different compiler backends do programs, are persistent tensors shared between the not need to implement support for the Div grad, Re- 3 LUGrad or sgd nodes for the cpu. once for some mobile dsp accelerator By contrast, classic machine learning frameworks and so on. This approach does not scale as the num- that are not able to automatically generate fused ber of opcodes and the number of hardware targets kernels (Section 5.2) need to implement hundreds Increase of CUDA and CPU compute kernels that represent Instead, Glow takes a different approach. Instead the un-lowcred opcrators. This limits thcir ability to of compiling thc high-lcvcl opcrators directly, Glow support new kinds of hardware and ties the to one performs "node lowering. In this phase, the compiler or two major hardware vendors breaks the high-level operator nodes into low-level linear algebra operator nodes. For example, the full 3.3 Variable Visibility Connected layer is represented as a matrix multiplica tion followed by broadcasted add. Different compiler Glow variables are similar to Py Torch and Tensor Flow backends do not have to implement the Fully con- variables. Thcy arc pcrsistent tensors that livc across nccted laycr and a dozen other high-levcl opcodes, different executions of the neural Network. Variables just the low-level Latrix Inlultiplication are annotated with Public or private labels. These This lowering phase drives many of the design deci- labels specify whether the node is visible outside of sions of the compiler. In Glow, lowering is performed the graph. If the node is public, then it means that as part of the high-level graph as described above C++ code from outside the graph may access the prior to moving to low-level IR(Section 3. 6). This is variable directly and change its content before or due to a number of reasons first the new lowered after the execution of the program. This means that graph may allow for additional graph-level optimiza- the optimizer is not allowed to delete unused public tions. Second, the new graph structure may affect the variables or change their dimensions. However, in the decisions of the instruction scheduler. And third. after case of private variables, the optimizer is allowed to lowering we allow the backends to perform additional delete unused variables, transpose, perform constant target-specific optimizations on the lowered graph propagation, etc The lowering pl hase comes after the graph is diff entiated. Because the lowering transformation does 3.4 Predication not preserve the semantics of the graph, it is not pos- sible to differentiate the graph for certain operators Predication is a well-known technique to control the For example, the Regression node(which produces execution of some node or instruction by means of a gradient when optimizing total squared error) be- boolean flag. If the value of the flag at runtime is set comes a no-op for the inference case, but is translated to'false' then the predicated node or instructions may into an element-wise subtract for the training case return any value. A correct program should know to Performing the lowering before differentiation would ignore the output of the predicated instruction be- prevent us from performing the correct lowering of cause it could be zeros or uninitialized memory. The the regression node type of the flag must be a boolean value or a vector of booleans that matches the batch size. Predicates 3. 6 Low-Level IR could accelerate the perforInance of soine networks by avoiding some computation. It can particularly be use- After optimizing the graph with target-independent ful when applied to Rccurront Ncural Nctworks 20 optimizations, and lowering from high-level operator because different elements of the batch may have dif- nodes to linear algebra operator nodes, the code ferent lengths and do not need to perform the same further lowered into the low-level IR. in a phase that amount of computation is called IRGen"(which stands for IR generation) This is a one-to-many translation where each high 3.5 Node Lowering level node is translated into one or more instructions The low-level IR enables a different kind of target The Glow compilation pipclinc solves thc problcm indcpcndent optimizations that arc not possible with of targeting a large number of opcodes to many dif- the high-level graph format. This is an instruction- ferent targets. Modern machine learning framework ased representation that operates on tensors that support hundreds of operators on many different hard are referenced by address. This gives the compiler ware backends. The approach that is taken by classic the ability to perform low-level memory optimizations machine learning frameworks is to implement each that are not possible at the high-level, because mem- opcode for each hardware target. In such frameworl ory is not represented directly. An example of such a RelU would be implemented once for the GPU, once transformation is the optimization that allows certain operations to transform some buffers in-place, such declare i as element-wise arithmetic Input weight float<8 x 28 x 28 x 1> broadcast. oo In the context of hardware acceleration, the loy %filter -weight float <16X 5 x 5 x 1 lcvcl instruction-bascd representation allows the com- avier 25.o Diler to represent device-specific operations such as Zfiltero weight float<16>, broadcast, asynchronous DMA operations. Hiding the latency Weights weight float <10 x 144>, xavier of memory operations is important for utilizing the 144.0 execution units of the hardware effectively, and the %bias weight float <10>, broadcast, 0.100 instruction-based representation allows the compiler %selected weight index<8 x 1 to create a schedule that hides the latency of the hresult weight float<8 x 10> memory operations The IR is strongly typed and cach instruction operand kind has known parameter types. It is de- program t Pallo alloc float<8 x 28 x 28 x 16> signed to be used as an in-memory form, though can conv convolution [5 1 216 out allo be dumped to human readable assembly-like format oin %input, oin filter3, oin %bias %allo0 alloc float<8 x 28 x 28 x 16> A function in iR form contains two sections: 'de %relu maxo out %alloo, in 'allo clare'and 'program. In the first section of the IR we Pallo1= alloc index<8 x 9 x 16 2> declare a nuInber of Menory regions that live through allo2 alloc float<8 x 9 x 16> Pool pool max [33 0] out %allo2,in out the lifetime of the program. This is similar to alloo, inout %allo1 global variables in C. The sccond part of the IR is a list of instructions. Each variable stated with %deal dealloc out %allo6 the kind of initialization that the program should do Adeal7 dealloc out %allo7 Adeal8 dealloc out %allo8 There are two kinds of memory regions which cor ldeal9 dealloc out %allo9 respond to these two sections: global memory re gions(found in declare)and locally allocated regions Figure 3: Unoptimized low-level Glow IR (found in prograIn). The locally allocated MenOry regions are similar to 'alloca in LLVM IR Memory rogions arc strongly typed, which mcans that thc kind 3.7 Summary: The Lifetime of a glow of type of tensor that the region represents is known Instruction Instructions operate on either global variables or locally allocated buffers. Each operand is annotated This section summarizes how instructions travel from with one of the qualifiers in/out/inout. 'in the beginning of the compilation pipeline, and through means that the buffer is read from, Qout means that the different levels of ir and to the backends. This is the buffer is written into. and inout'mleauls that a high-level overview of the compilation process the instruction may read and write into the buffer These operand qualifiers help the optimizer decide 1. The graph is either loaded via the graph loader when it is legal to perform certain optimizations, such (from ONNX or Caffe 2 format), or constructed as copy elimination or buffer sharing. Instructions via the c++ interface may have other attributes that specify the legality of 2. The graph is differentiated if needed some optimizations. For example, some instructions 3. The graph is optimized require that the data froin the forward pass would be kept around for the backward pass, so if the program 4. Lincar algebra nodc lowering takes placc is not optimized for infcrcncc-only modc then ccrtain 5. Additional rounds of optimizations occur, both memory optimizations cannot happen target independent and target specific Figure3 shows an example of unoptimized Glow 6. The graph is scheduled into a linear sequence of IR. Note that the alloc' instruction does not allocate nodes that minimizes memory usage memory; it just marks the lifetime of the activation The low-level memory allocator is responsible for allo- 7. IRGen converts the low-level graph into instruc cating all of the buffers into a single coalesced region tions 8. LOW-level TR optimizations are performed http://llvm.org/docs/langref.html# 9. Backend-specific optimizations and code genera alloca-instruction tion are performed 5 BB. newInstr("AvgPool") a quantized tensor' s type is made up of the underlying addOperand("Dest", OperandKind:: Out) element type(Int8), as well as the possible range of ddOperand("Src", OperandKind:: In) addMember(MemberType: SizeT, "Kernel") the values in the tensor using " scale'andoffset'fields ddMember(Member Type:: SizeT, "Stride") To COlvert fronn the 8-bit integer range of [-128..127 addMember(Member Type: SizeT, "Pad") to the floating-point number that they represent, Glow autoIrgen o uscs the following convcrsion formula autoVerify(verifykind:: SameElementType value -(input -offset)* scale addGradientInstr(["Dest"] i"Dest", "Src"h) Activations, weights, and variables all use the same Figure 4: Exanple of the class-gen for the Average Pool type-systelnl and represent inforIatiOn ill a ulliforIn instruction wa 4.2 Profile-Guided Quantization 3.8 ClassEn Different parts of the network contain floating-point Glow uses automatic code generation techniques values in different ranges. In some parts, the typical (class-gen) for defining instructions and nodes. The range of the numbers is between zero and one, while purpose of the automatic code generation tools in in other parts of the network the possible range is in similar to the motivation behind LLVM's the hundreds. Choosing a single conversion scale for TableGen, which is to help a human develop and the whole network would not work, because a single maintain records of domain-specific information scale value could be iInprecise for sInlall values alld The current system is capable of generating two truncate large values kinds of classes: nodes for the high-level IR and In- We use profile-guided information to estimate the structions for the low-level IR. Figure 4 shows an possible numeric range for each stage of the neural exanple of the code for generating the AvgPool network. Our quantization conversion works using a struction. ClassGen generates most of the methods two-phase process. First, we statically instrument the that instructions need to have, such as instruction network with special profiling nodes that record the equality and hashing, cloning, printing, verification, ranges of activations that flow in the network, opti- to mize the network including these profiling nodes, and then run inference. Then, wc recompile the nctwork using this profile information to convert the network 4 Quantization nto a quantized form, allowing for static optimization of the quantized graph. We convert portions of the the context of machine learning: quantization is network into islands of integer computation and aim thc proccss of converting thc nural nctwork from to generate outputs in the range that the original floating-point arithmetic to integer arithmetic. Arith floating-point network produces. Figure 5 shows a metic using small integers is more efficient than the quantized subgraph from Rcsnct50 computation of full-width floating-point numbers, and additionally decreases memory usage 4.3 Compiler Optimizations For Glow is able to convert foa- point-based net. works into signed 8-bit integer networks. The canoni Quantization cal quantization representation is using signed inte- Glow features a number of compiler optimizations gers, though it is possible to support other quantiza- that transform the compute graph and make it more tion formats. Glow uses profile-guided quantization, efficient. There are a few classes of optimizations and observing execution during inference to estimate the parameters to optimize possible numeric range for each stage of the neural First, we attempt to minimize the number of con network. Training-based quantization is considered versions bctwccn floating-point tensors and integer future work tensors, in both directions. Some operations, such as transpose' operate on both types, and 4.1 Tensor Representation changing the representation can minimize conversions Second. the neural network contains 'rescale' nodes In Glow, tensors are typed and can represent floats, that change the range of the integers. These nodes quantized non-floating-point values such as currently are required to convert between numeric ranges that supported Int8( 8-bit signed integers), and index types. mimic the original floating-point network. He owever 6 The cPu backend needs to generate code for machine pu::float :x 3 x 224 x 224: visibility Ful learning operators such as Convolution and Soft max Outrut One possibility is to call into some external library such as Eigel. This is easy to do, alld mally inlachine 0640-662.2597031k<64 learning frameworks use this technique. The disad nanc: quantizc246 lmu; fou compilation process, Glow loads the bitcode from disk as:i[5003649.60l-2.259,7.031k64> and specializes the operator implementations for the specific context. Glow replaces function arguments Resalt;i8[S0.0410-12810.0008.703k<1x112x112x6 that represent the dimensions of some tensor or buffer addresses with constants that llVM can optimize to generate efficient code. The compiler can decide Figure 5: A quantized subgraph from Resnet50 on the kind and level of operator specialization to perform, and trade compile time and binary size for performance in many cases, it is possible to fold the rescale oper Most operators are very simple and the LLvm vec ations into numeric-producing operations, and elimi- torizer 11 is able to generate very efficient code nate the Notice that by providing the exact tensor dimensions Third, it's possible to rescale the values in the net and loop trip count the vectorizer is able to generate work in order to a llow fast hardware implement a tions efficient code that does not contain pre-header legality of the quantized operations. For example, consider check and scalar loop to handle the remainder odd the max operations. By converting both sides of the iterations. The convolution and matrix multiplica- maxinto the same scale we allow the hardware to tion operations are hand-optimized in C++ using the perform a simple comparison. By normalizing both clang extended OpenCL vector syntax, and LLVM sides of thc ' max'opcration to thc samc scalc we does a good job allocating registers and encoding the enable this efficient optimization nstructions, removing the need to use inline assembly 5 cPU Backend 5.2 Operator Stacking One inportant optinization that the CPu backend This section describes the implementation of the CPu implements is stacking of data-parallel operators. Con backend. The Glow CPU backend compiles the low sider a sequence of operators that operate one element lcvcl intermediate reprcscntation into an optimized at a time, for example a reLU, Add, Sub Iterating stream of instructions. It uses Llvm to optimize and over a large buffer multiple times is inefficient because emit machine code and was tested on x86 and ARM64. it requires the cpu to load the memory multiple times The backend can emit a stand-alone object file to disk each time invalidating the whole cache. Instead, Glow or execute code in just-in-time mode. The backend stacks operators and performs a few data-parallel emits debug information, which makes it possible to perators one after the other on the same memory debug glow in a debugger and place a breakpoint in location Notice that as dcscribcd abovc. this is not spccific opcrator, or undcrstand the performance of an optimization that LLVM can perform by itself and networks using a profiler it requires a special high-level data structure Operator stacking is similar to operator fusion 5.1 Standard library Ilowever, when fusing multiple operators(e. g. Conv and RelU fused together), all backends that want to One interesting aspect of the Glow CPU backend is support this fused operator must implement a specific the use of a small target independent standard library ernel ror each permutation of operators. In contrast Filter layout before transformation [depth, filter_x, filter-y, channel] output: float Filter layout after transformation lity public [depth/N, filter_x, filter, channel, N] Output Figure 6: Transformation of a convolution s filter's mem ory layout to optimize for SIMD memory accesses. Depth refers to the output depth of the filter, and channel refers mamut: noa K> cupar: nak 3a6 visiblity private to the input channel users: I al:0.C00000e+00 Result: Hoat Output users:I Glow's stacking automatically creates such kernels all of the ble p data-parallel node CPUConvdKKc8 arc automatically fuscd into a fast kernel con The approach of stacking multiple operations has Filter float<8 x7x7x3x3> Bias: float<64> many advantages. First, there is an immediate perfor mance gain for places in the graph where data parallel users Rcsult foat operators are placed one on top of the other. Sec- Result ond, backends do not need to implement kernels for all possible permutations of consecutive data-parallel nodes. And lastly, it allows Glow to lower high-level name: relu49 operators knowing that the backend can fuse them put: float and recover the performance foat 1 x 112x 112 x 642 For example, Glow lowers the SGD(stochastic gra Result dient descent) operator into a sequence of low-level primitives that include addition, subtraction, and multiplication. Lowering the SGD nodc into low-levcl Figure 7: A subgraph from Resnet50 optimized for the primitives simplifies the design of the compiler by cPU backend. the cpuconydkkc8 node has had it reducing the opera. tor-space that the backend needs memory layout modified for efficient SiMD access(F to handle. Operator stacking can also accelerate com ure6p. Note that CPUMaxSplat is also a CPU-backend putation on GPUs by reducing the kernel launch over specific node that performs a Max operation with some scalar inpu head 5.3 Use Case: Optimizing Resnet50 ifferent compilation strategy. Next, the target- for the cPu specific optimizer mutates the graph and generates code that matches the selected convolution each In this section, we describe the way that Glow opti- convolution kind uses a different filter memory layout mizcs Rcsnct50 to gcncratc an cfficicnt strcam of x86 and tile size. Figure 6 depicts the transformed filter instructions.Resnet50 is a residual convolutional neu- memory layout ral network that contains 54 convolutions as well as This 5-dimensional tensor layout allows for con- other operators such as element-wise addition, RelU secutive SIMD memory access. The n parameter is batch normalization, max and average pooling Fully- selected based on the iteration order and the block- Connected, and SoftMax. Glow optimizes Resnet50 ing strategy for the convolutiOn. The CPU backend by performing both high-level and low-level optimiza- traverses the graph and replaces any convolutions it. tions would like to optimize in this way with this specialized First, high-level transformations eliminate redun- convolution. This can be seen in Figure dant transpose operations and merge the batch nor- The second parameter that the compiler controls malization opcration with a convolution nodc. Next, is the size of the convolution tile. Glow selects a the cpu backend transforms the graph into a target- processing tile that depends on the size of the first specific graph that allows device-specific optimization. level cache of the processor The cpu backend identifies three kinds of convolu Next, the low-level optimizer optimizes the instruc tions: convolutions with a small number of channels, tion stream by shrinking the lifetime of memory allo- convolutions where the size of the input activation cations for the activations, and then performs static buffer is large, and convolutions where the filter weight memory allocation for the whole network into a single buffer is large. Each one of these convolutions requires buffer. This reduces the mutable memory footprint ■ TensorFlow17XA LBB141 vmovaps 3211264(grcxrax, 4).ymm1 vmovaps 3211296(%rcx, Trax, 4),%ymm2 15.5fp vmovaps 3211328(%rcx, %rax, 4),%ymm3 vaddps 6422528(%rcx,%rax,4),%ymm1, %ymm1 vaddps 6422560(%rcx, %rax, 4),%ymm2,%ymm2 vmovaps 3211360(rcx, Trax, 4), 7ymm4 vaddps 6422592(%rcx, Trax, 4),7ymm3, 7ymm3 vaddps 6422624(%rcx, %rax, 4), %ymm4,%y mm4 vmaxps %mmo, %ymm1, %ymm1 vmaxps % mmo, %ymm2, %ymm 2 6.1 fps vmaxps y ymmO, %ymm3, h ymm3 vmovaps % ymm1, 6422528(rcx, trax, 4) 2.9 fps vmovaps %ymm2, 6422560(%rcx, brax, 4) vmaxps %mmo, %ymm4, %ymm1 vmovaps %ymm 3, 6422592(%rcx, %rax, 4) Resnet50 VGG19 vmovaps %ymm1, 6422624 (%rcx, % rax, 4) addq $32, brax Figure 9: Glow vs. Tensor Flow-17 on all Intel core Fig A loop with a fused element-wise addition and 17-7567U; frames per second on a single core RelU (max)operation Conclusion of the network. From this point in the compilation This paper presented the design of Glow pipeline the compiled code can refer to pointers in learning compiler for heterogeneous hardware. Glow menory lowers the compute graph of neural networks to multi Finally, the compiler performs efficient code gen- cration for thc non-convolution parts of thc nctwork level strongly-typed intermediate representations, en- For example, Figure 8 depicts the gemerated assenbly abling analyses and optimizations appropriate for cach for some part of the network. The compiler fused level to efficiently and scalably target many backends two unrelated element-wise operations into a single We hope our efforts will enable research in the area loop. The Add and Max operations are performed on of machine learning acceleration the same memory buffer without reading the memory twic 8 Acknowledgements II addition to the core Glow teaIll, Illally fellow people 6 Evaluation at Facebook have made contributions to the project including Andrew Adams, Michel Aoun. Sarah Bird, We coMpare perforlllalce of Glow alld TelsorFlow- Evan Cheng. Soumith Chintala. Chris Dewan. Utku 1.7 on three popular convolutional neural networks Diril, Marat Dukhan, Dmytro Dzhulgakov, Peter listed in Figure 9 The benchmarks were executed Goldsborough, Kim Hazelwood. Yangqing Jia, Daya on a Kaby Lake Intel Core i7-7567U(which does S Khudia, Howard Mansell, Erik Meijer, Maxim Nau- not support AVX-512 running on a single CPU core. mov, Pieter Noordhuis, Joe Pamer, Lin Qiao, Vijay Both TensorFlow and Glow were compiled to support Rao. Martin Schatz. Alexander Sidorov. Andrew Tul- the native architecture. Tensor Flow was compiled loch, Nicolas Vasilachc, Adam Wcis, and Hoctor Y with XLA enabled. We used the Keras library (21 We also would like to thank Eli Bendersky, Chris to supply and run pre-trained models for Tensor Flow. Leary, Richard Wei, and Tianqi Chen for the develop Our benchmarks used a batch size of 8. Performance ment and release of their work to the open source (in raines per second) did llot depend on the batch size, i.e. tot al execution time scaled linearly wit h References batch size As seen in Figure 9 Glow is up to 2. 5x faster than [1] John L. Henncssy and David A. Patterson. Computer Tensor Flow. This is due to the fact that TensorFlow Architecture, Sith Edition: A Quantitative Approach :hapter 7. 2. 6t h edition, 2017 alls into Eigen which implements convolution using the classic im2col followed by matrix multiplication 2 XLA: Domain-specific compiler for linear algebra to opti- mizestensorflowcomputationshttps://www.tensorflow while glow compiles direct convolution(Section 5 org/performance/xla and thus avoids im2col overhead. In addition Glow 3 Adam Paszke, Sam Gross, Soumith Chintala, Gregory performs shape-aware code-generation Chanan, Edward Yang, Zachary DeVito, Zeming Lin, ban Desmaison Luca antiga and adam Lerer Automatic Executor for Deep Learning. ArX2 e-prints, January differentiation in Py Torch. In NIPS-W, 2017. 4YangqingJia,EvanShelhamer,JeffDonahue,Sergey[17]IntelMkl.https://software.inte el. com/en-us/mk1 Karayev, Jona. t han Long, Ross girshick, Sergio Guadar- rama, and Trevor Darrell. Caffe: Convolutional Architec [18N. Vasilache, O. Zinenko, T. Theodoridis, P Goyal, Z De ture for Fast Feature Embedding. In Proceedings of the Vito. W. S. Moses, S. Verdoolaege, A. Adams, and A. Co- 22nd ACM International Conference on Multimedia, MM hen. Tensor Comprehensions: Framework-Agnostic Iligh- '14, pages 675 678, Ncw York, NY, USA, 2014. ACM Performance machine Learning abstractions. Arxiv e prints, February 2018 5 Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chell, Andy Davis, Jeffrey Deal, Matthieu Devin, sanjay [19]swift.https://developer.apple.com/swift/ Ghemawat, Geoffrey Irving, Michael Isard, Manjunath [ 20) Ian Goodfellow, Yoshua Bengio, and Aaron Courville Kudlur, Josh Levenberg, Rajat Monga, Sherry moore, Deep Learning ItPress2016.http://www Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay deeplearningbook org Vasudevan. Pete Warden. Martin Wicke. Yuan Yu and Xiaoqiang Zheng. Tensor Flow: A System for Large-scale [21]Keras:ThePythonDeepleArninglibraryhttps:// keras.io/ Machine Learning. In Proceedings of the 12th USENIX Co Systcms Do nd Imples tation, OSDI'16, pages 265-283, Berkeley, CA, USA, 2016 USHNIX ASSOciation (6onnx.https://onnx.ai/ 7 Frank Seide and Amit Agarwal. Cntk: Microsoft's open source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD Inter national Conference on Krowledge Discovery and Data Mining, KDD '16, pages 2135-2135 New York. NY. USA. 2016. ACM 18eigen.http://eigen.tuxfamily.org [9]NvidiACudnn.https://developer,nvidiacom/ cudnn 10 Chris La nd Vikralnl Adve. LLVM: A CoInpilatiOI Framework for Lifelong Program Analysis Transforma tion. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO04, pages 75-, Washington, DC, USA, 2004. IEEE Computer Society 11 Nadav Rotem and Arnold Schwaighofer. Vectorization inliVm.https://11vm.org/devmtg/2013-11/#talk10 2013 LLVM Developers meeting 1 12 University of Washington Paul G. Allen School of Com puter Science &z Engineering, Amazon Web Service A team, and DMLC open-source community. NNVM Com pileropEnCompilerforAiFrameworkshttp://tvmlang org/2017/10/06/nnvm-compiler-announcement html 13 T Chen, T. Moreau, Z Jiang, H. Shen, E. Yan, L. Wang, Y Hu, L. Ceze, C Guestrin, and A Krishnamurthy TVM End-to-End Optimization Stack for Deep learning ATX e-prints, February 2018 [14] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams Sylvain Paris, Fredo Durand, and Saman Amarasinghe Halide: A Language and Compiler for Optimizing Paral- lelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language De plem...tion, PlII'13, pages 519-5.0. New York, NY USA. 2013. ACM. 15 R. Wei, L. Schwartz, and V. Adve. DLVM: A modern compiler infrastructure for deep learning systems. Ar Xiv e-prints, November 2017 16S Cyphers, A. K. Bansal, A. Bhiwandiwalla, J. Bobba, M. Brookhart, A Chakraborty, W. Constable, C. Conve L. Cook, O. Kanawi, R. Kimball, J. Knight, N. Korovaiko V. Kumar. Y. Lao. C. R. Lishka. Menon. J. Mvers S. Aswath Naravana. A. Procter and T.. Webb. Intel l Graph: An Intermediate Representation, Compiler, and 10

(系统自动生成,下载前可以参看下载内容)
下载文件列表

相关说明

本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.

本站是交换下载平台，提供交流渠道，下载内容来自于网络，除下载问题外，其它问题请自行百度。

本站已设置防盗链，请勿用迅雷、QQ旋风等多线程下载软件下载资源，下载后用WinRAR最新版进行解压.

如果您发现内容无法下载，请稍后再次尝试；或者到消费记录里找到下载记录反馈给我们.

下载后发现下载的内容跟说明不相乎，请到消费记录里找到下载记录反馈给我们，经确认后退回积分.

如下载前有疑问，可以通过点击"提供者"的名字，查看对方的联系方式，联系对方咨询.

相关搜索: glow编译器

输入关键字，在本站1000多万海量源码库中尽情搜索：

下载资源分类

移动开发

开发技术

课程资源

网络技术

操作系统

安全技术

数据库

行业

服务器应用

存储

信息化

考试认证

云计算

大数据

跨平台

音视频

游戏开发

人工智能

区块链

资源分类

Actionscript

C

C#

C++

Delphi

Java

Javascript

Perl

PHP

Python

VB

Web开发

硬件开发

其它

本站统计

资源总数：630万个

资源大小：15TB

今日更新：468个

注册人数：225万

今日注册：838

加入“点数信息”会员

　　“点数信息”是专业的,大型的源码,编程资源等搜索,交换平台,旨在帮助软件开发人员提供源码,编程资源下载,技术交流等服务!目前源码资源大小已超过8TB。
　　超值价格，购买下载积分，即时到帐，无需等待马上可以下载你所需的资料。无限期使用，一次购买越多越优惠！

免费获取积分

　　免费获得积分的途径是通过会员下载您上传的资料，您的帐户即增加积分。
　　立即上传资料，越多越好，被搜索到的机会越大！越早上传越早得积分，下载次数越多，您的积分越多。

合作伙伴

CodeProject

搜珍网

建筑工程网

CSDN.net

建筑资料网