文件名称:
TensorFlow.js Machine Learning for the Web and Beyond.pdf
开发工具:
文件大小: 568kb
下载次数: 0
上传时间: 2019-07-15
详细说明:TensorFlow.js, Google 提供的基于TensorFlow的JavaScript库。方便使用JS的开发者使用,并且可以为未来的边缘计算提供支持。TensorFlow. js: Machine Learning for the Web and beyond
acceleration, notably TensorFire(Kwok et al., 2017), Propel Layers APl, which provides higher-level model buildin
(built on top of Tensor Flow. js)(Dahl, 2017) and Keras.js blocks and best practices with emphasis on neural networks
(Chen, 2016), however they are no longer actively main- The Layers APl is modeled after the tf keras namespace in
ained
Tensor Flow Python, which is based on the widely adopted
Webdnn (Hidaka et al, 2017)is another deep learning li-
Keras API(Chollet et al., 2015)
brary in Js that can execute pretrained models developed in
TensorFlow, Keras, Py Torch, Chainer and Caffe. To acceler
API
ate computation, Webdnn uses WebGPu (Jackson, 2017)
a technology initially proposed by Apple. WebGPU is in
Layers AP
an early exploratory stage and currently only supported in
Safari Technology Preview, an experimental version of the
Ops API(Eager)
Safari browser As a fallback for other browsers WebDnn
uses WebAssembly(Haas et al., 2017), which enables exe
Browser
Node js
cution of compiled c and c++ code directly in the browser
While Web assembly has support across all major browsers,
WebCL
TE CPI
TE GPU
TE TPU
t lacks SIMD instructions, a crucial component needed to
Runtime
make it as performant as WebGL and webGPu
Figure 1. Overview of the Tensor Flow. js architecture
3 DESIGN AND API
The goals of Tensor Flow. js differ from other popular ML Tensor Flow. js is designed to run in-browser and server-side
libraries in a few important ways. Most notably, Tensor
as shown in Figure l. When running inside the browser, it
Flow. js was designed to bring Ml to the JS ecosystem, utilizes the gpu of the device via Webgl to enable fast
empowering a diverse group of js developers with limited
parallelized floating point computation. In Node. jS, Ten
or no Ml experience(Anonymous, 2018). At the same time, sor Flow js binds to the Tensor Flow C library, enabling full
we wanted to enable experienced ML users and teaching access to TensorFlow. Tensor Flow. js also provides a slower
enthusiasts to easily migrate their work to JS, which neces- CPU implementation as a fallback(omitted in the figure
stated wide functionality and an API that spans multiple for simplicity), implemented in plain JS. This fallback can
levels of abstraction. These two goals are often in conflict
run in any execution environment and is automatically used
requiring a fine balance between ease-of-use and function- when the environment has no access to WebGL or the Ten
ality. Lastly, as a new library with a growing user base, sorFlow binary
missing functionality was prioritized over performance
These goals differ from popular deep learning libraries
3.2 Layers API
(Abadi et al., 2016: Paszke et al., 2017), where performance
Beginners and others who are not interested in the operation-
is usually the number one goal, as well as other JS ML li
level details of their model might find the low-level oper-
braries(see Section 2.3), whose focus is on simplicity over
ations API complex and error prone. The widely adopted
completeness of functionality. For example, a major differ
Keras library( Chollet et al., 2015), on the other hand, pro
entiator of TensorFlow js is the ability to author and train
vides higher-level building blocks with emphasis on deep
models directly in JS, rather than simply being an execution
environment for models authored in Pyth
learning. With its carefully thought out APl, Keras is pop-
ular among deep learning beginners and applied ml prac
titioners. At the heart of the api is the concept of a model
3.1 Overview
and layers. Users can build a model by assembling a set of
The APi of TensorFlow is is largely modeled after Ten- PI
ore-defined layers, where each layer has reasonable default
sorFlow, with a few exceptions that are specific to the Js parameters to reduce cognitive load
environment. Like TensorFlow, the core data structure is the For these reasons, TensorFlow js provides the Layers aPI
Tensor. The TenisorFlowjs API provides methods to create which mirrors the Keras api as closely as possible, in
tensors from js arrays as well as mathematical functions
cluding the serialization format. This enables a two-way
that operate on tensors
door between Keras and TensorFlow jS: users can load a
Figure 1 shows a high level schematic view of the architec
pretrained Keras model(see Section 5. I)in TensorFlow js
ture. TensorFlow js consists of two sets of APIs: the Ops
modify it, serialize it, and load it back in Keras python
API which provides lower-level linear algebra operations Listing 1 shows an example of training a model using the
(e.g. matrix multiplication, tensor addition, etc. ) and the Layers aPl
TensorFlow.js: Machine Learning for the Web and beyond
// A linear rccel with 1 dense layer
are graph-based and eager Graph-based engines provide an
const model= tf sequential();
API to construct a computation graph and execute it later
model. add(tf layers. dense(
When computing gradients, the engine statically analyzes
units: 1, input Shape: [1]
the graph to create an additional gradient computation graph
This approach is better for performance and lends itself
// Specify the loss and the optimizer
easily to serialization
model. compile(t
loss: meanScuaredError'
Eager differentiation engines, on the other hand take a dif-
optimizer:/sad'
ferent approach(Paszke et al., 2017; Abadi et al., 2016;
});
Maclaurin et al., 2015). In eager mode, the computation
happens immediately when an operation is called, making
Generate synthetic data to train
it easier to inspect results by printing or using a debugger
const xs=
another benefit is that all the functionality of the host lan
tf tensor 2c([1
3;4],[4,1]);
const. ys
guage is available while your model is executing; users can
t. tensors([1,3,5;7],[4,1])
use native if and while loops instead of specialized control
flow A Pis that are hard to use and produce convoluted stack
// Train the rodel using the data
traces
nodel riL(xs, ys). then(()=> i
// Do inference on an unseen data point
Due to these advantages, eager-style differentiation engines
and print the result
like TensorFlow Eager(Shankar Dobson, 2017)and Py-
const x = tf. tensor2d([5],[l, 1])i
model predict(x).print()i
Torch(Paszke et al., 2017), are rapidly gaining popularity
Since an important part of our design goals is to prioritize
ease-of-use over performance, TensorFlow js supports the
Listing /. An example Tensor Flow js program that shows how to eager style of differentiation
build a single-layer linear model with the layers API, train it with
synthetic data, and make a prediction on an unseen data point
3.6 Asynchronous execution
3.3 Operations and Kernels
JS runs in a single thread, shared with tasks like page lay
out and event handling. This means that long-running JS
As in TensorFlow, an operation represents an abstract com- functions can cause page slowdowns or delays for handling
putation(e.g. matrix multiplication) that is independent of events. To mitigate this issue, jS users rely on event call-
the physical device it runs on. Operations call into kernels, backs and promises, essential components of the modern JS
which are device-specific implementations of mathematical language. A prominent example is Node js which relies on
functions which we go over in Section 4.
asynchronous 1/O and event-driven programming, allowing
the development of high-performance, concurrent programs
3.4 Backends
However, callbacks and asynchronous functions can lead
To support device-specific kernel implementations, Tensor- to complex code. In service of our design goal to provide
Flow.js has a concept of a backend. a backend implements intuitive APIs, TensorFlow js aims to balance the simplicity
kernels as well as methods such as read( and write( which of synchronous functions with the benefits of asynchronous
are used to store the TypedArray that backs the tensor. Ten- functions. For example, operations like tf matMul() are pur
sors are decoupled from the data that backs them, so that posefully synchronous and return a tensor whose data might
operations like reshape and clone are effectively free. This not be computed yet. This allows users to write regular syn
is achieved by making shallow copies of tensors that point to chronous code that is easy to debug. When the user needs to
the same data container(the TypedArray). When a tensor is retrieve the data that is backing a tensor, we provide an asyn-
disposed, we decrease the reference count to the underlying chronous tensor data() function which returns a promise that
data container and when there are no remaining references, resolves when the operation is finished. Therefore, the use
we dispose the data container itself
of asynchronous code can be localized to a single datal call
Users also have the option to call tensor data Sync(, which is
3.5 Automatic differentiation
a blocking call. Figures 2 and 3 illustrate the timelines in the
browser when calling tensor: dataSync( and tensor data()
Since wide functionality was one of our primary design
respectively
goals, TensorFlow jS supports automatic differentiation, pro-
viding an API to train a model and to compute gradients
The two most common sty les of automatic differentiation
TensorFlow.js: Machine Learning for the Web and beyond
:Pu matmul add rel
dataSync
3.8 Debugging and profiling
TensorFlow. js provides a rich set of debugging tools to help
GFU
matmul
add relu read Pixel
developers understand common problems with performance
and numerical stability, accessible either via a URL change
or a feature Hag. Users can profile every kernel that gets
called, seeing the output shape, memory footprint, as well
Figure 2. The timeline of a synchronous and blocking
as device-specific timing information. In this mode, every
sordataSync( in the browser. The main thread blocks until
GPU is done executing the operations
tensor gets downloaded from the GPU and is checked for
NaNs, throwing an exception at the first line a nan is in-
troduced, showing model developers which operation is the
source of the numerical instability
PU matmul add relu data Idle! Respond to events, update UL.resolved data
Tensor Flow. js also provides tf time(f) for timing a function
that calls Tensor Flow. js operations. When calling tf time(f
GFU
matmul
add relu readPixels
the function f will be executed and timed. Each backend is
responsible for timing functions, as timing may be device
specific. For example, the WebGL backend measures the
exact GPU time, excluding time for uploading and down-
Figure 3. The timeline of an asynchronous call to datao in the loading the data
browser. The main thread is released while the gpu is executing
the operations and the datao promise resolves when the tensor is A more generic API,tfprofile(f), similarly takes a function
ready and downloaded
f and returns an object representing the functions effect
on memory. The object contains the number of newly allo
cated tensors and bytes created by executing the function
well as the peak tense
d
allocated inside the
3.7 Memory management
function Understanding peak memory usage is especially
important when running on devices with limited memory
Js provides automatic garbage collection. However, in the such as mobile phones
browser WebGL memory is not automatically garbage col-
lected. Because of this, and the lack of finalization, we 3.9 pe
erformance
expose an API for all backends to explicitly manage mem
While performance was not the single most important goal
mory allocated by a tensor, users can call it was critical in enabling real-world ML in JS. In the
browser, Tensor Flow s utilizes the gpu using the WebGL
tensor. dispose(. This approach is relatively straightforward
API to parallelize ce
By using webgl for nt
but the user has to have a reference to all tensor objects so
merical computation, we were able to achieve 2 orders of
they can be disposed. Often models are written as chained
magnitude speedup, which is what fundamentally enabled
blocks of operations, so breaking up the chains for disposal
running real-world ML models in the browser. On the server
can be cumbersome Since tensors are immutable and opera
side, Tensor Flow. js binds directly to the TensorFlow C API
tions are functional, a single op call can allocate a significant
number of intermediate tensors. Forgetting \s diSpose these whicl
which takes full advantage of native hardware acceleration
intermediate tensor results in memory leaks and slows down Table I shows the speedups of these implementations rela-
the application significant
tive to the plain JS CPU counterpart. We measure a single
inference of MobileNet v1 1.0(Howard et al., 2017) with
TensorFlow.js offers an alternative approach. since func-
an input image of size 224 x22433, averaged over 100 runs.
tions are first-order citizens in JS, and a large portion of the
All measurements, other than those mentioning gtX 1080,
native JS API uses functions as arguments, we decided to
are measured on a macbook pro 2014 laptop while the
provide a scoping mechanism where the user can wrap any
GTX 1080 measurements are done on a desktop machine
synchronous function f by calling tf tidy(()=fO). This Note that the Webgl and the nodes cpu backends are
results in calling f immediately, and disposing all inter
two orders of magnitude faster than the plain js backend
mediate tensors created inside once f finishes, except for
hile those utilizing the capable gtx 1080 graphics card
the return result of f. We use this mechanism extensively
are three orders of magnitude faster
n our library. Users of the Layers API do not need ex
plicit memory management due to model-level APIs such Since the launch of TensorFlow]s, we have made signifi
as model fit(, model predict() and model. evaluate( which cant progress on improving our WebGL utilization. One
internally manage memory.
notable improvement is packing, where we store floating
TensorFlow.js: Machine Learning for the Web and beyond
Backend
Time(ms) Speedup
To work around the limitations and the complexities of we
Plain js
3426
Ix
bGl. we wrote a layer of abstraction called the gpgpucon
WebGL (Intel Iris Pro)
49
71x
ext which executes WebGl fragment shaders representing
WebGL(GTⅹ1080)
685x
computation In a graphics program, fragment shaders are
Node. js CPU W/AVX2
87
39x
typically used to generate the colors for the pixels to be
Node. jS CUDA (GTX 1080)
1105X
rendered on the screen. Fragment shaders run for each pixel
independently and in parallel; TensorFlow js takes advan
tage of this parallelization to accelerate ML computation
Table 1. Speedups of the WebGl and node. js backends over the
plain JS implementation. The time shows a single inference of In the Webgl backend, the draw pipeline is set up such
MobileNet v1 1.0(How ard el al., 2017), averaged over 100 runs. that the scene geometry represents a unit square. When
we execute a fragment shader program, we bind the texture
that backs the output tensor to the frame buffer and execute
point values in all 4 channels of a texel (instead of using the fragment shader program. This means that the fragment
only I channel). Packing resulted in 1.3-1.4x speedup of shader main() function is executed in parallel for each output
models such as Pose Net(Oved 2018)across both mobile
value, as shown in 4. For simplicity we only use the red
and desktop devices
channel of the texture that backs the tensor(shown as 'r
in the figure). On WebGL 2.0 devices, we use the gl. R32F
While we will continue to work on our WebGL implementa- texture type which allows us to avoid allocating memory for
tion, we observed a 3-10x gap in performance between We- the green, blue, and alpha channels(shown as'G','B, and
bGL and CUDA. We believe the gap to be due to WebGls
A respectively). In future work, Tensor Flow. js will take
lack of work groups and shared memory access, benefits pro- advantage of all channels for WebGL 1.0 devices, which
vided by general-purpose computing(GPGPU)frameworks will better utilize the GPU's sampler cache
like CUDA(Nickolls et al., 2008)and OpengL Compute
shaders ( shreiner et al., 2013). As we discuss below in Sec
tion 4.3, we believe that the upcoming WebGPu (Jackson,
A
B
2017) standard is a promising avenue for bridging the gap
in performance
1020
4 IMPLEMENTATION
3
4
3040
This section describes the specific constraints and imple-
mentations of the various backends that are supported by
void main()i
TensorFlow. js
ivec2 coords getoutputCoords()
float a= getA(coords[o], coords [11);
4.1 Browser and WebGL
float b= getb(coords[0], coords[11):
With the advent of deep learning and scientific comput
float result a +:
ing in general, and advances in modern GPU architectures
setoutput(result);
the use of GPGPU has grown tremendously. While mod
3344
ern s virtual machines can optimize plain js extensively,
its performance is far below the computational power that
GPUS provide(see Table 1). In order to utilize the GPu
TensorFlow.js uses WebGL, a cross-platform web standard
providing low-level 3D graphics APIs. Unlike opencL and Figure 4. The addition of two equally shaped matrices as executed
CUDA, the WebGL API is based on OpenGL ES Specifica-
by the Webgl backend, and the glsl code of the fragnent shader
tion( Shreiner et al, 2013) which has no explicit support for
that represents the element wise addition computation. The glsl
GPGPU
function, main(, runs in the context of each output value and in
parallel, with no shared memory
Among the three TensorFlow js backends, the WebGL back
d has the highest complexity. This complexity is justified Writing OpengL Shading Language(GLSL) code can be
by the fact that it is two orders of magnitude faster than error prone and difficult. To make it significantly easier
our CPu backend written in plain Js. The realization that to write and debug gPGPu programs, we wrote a shader
WebGL can be re-purposed for numerical computation is compiler. The shader compiler provides high-level GLSL
what fundamentally enabled running real-world Ml models functions that the shader author can call. Listing 2 shows
in the browser
the glsl source code for matrix multiplication where the
TensorFlow.js: Machine Learning for the Web and beyond
shared dimension N is assumed to be a multiple of 4 for sim- thread. This means that while programs are running on the
plicity. The functions marked with bolded font are provided GPU, the CPu is free to respond to events and run other Js
y our shader compi
USing the higher level functions generated by the shader When the user calls an operation, we enqueue a program
compiler has multiple benefits. First, the user-defined onto the GPU command queue, which typically takes sub
GLSL code operates in high-dimensional'logicalspace millisecond time, and immediately return a handle to the re
instead of the physical 2D texture space. For example, the sulting tensor despite the computation not being done. Users
GLSL implementation of tf conv(uses the auto-generated can later retrieve the actual data by calling tensor dataSynco
getA(batch, row, column, depth)method to sample from a or tensor datal), which returns a Typedarray
4D tensor. This makes the user code simpler, more readable As mentioned in Section .6. we encourage the use of the
and less error-prone
asynchronous tensor data() method, which avoids blocking
Second, the separation of logical and physical shape allows the main thread, and returns a promise that resolves when the
the framework to make intelligent decisions about mem- computation is done(see Figures 2 and 3). however, to re
ory layout avoiding device-specific size limits of Webgl trieve the underlying data of a texture the webGL api onl
textures
provides a blocking gl.read Pixels()method. To get around
Third, we can optimize the mapping from high-dimensional
this limitation, we approximate when the GPU is done exe
space to the 2D space. For example, assume the logical
cuting the operations, postponing the call to gl. readPixelso
which releases the main thread in the meantime
shape of tensor A is 4D with shape lx3xlx2. when A gets
uploaded to the GPu, the backend will allocate a physical Approximating when the GPU has finished executing pro
x2 texture and the compiler will generate a gelA(a, b, c, d) grams can be done in a couple of ways. The first approach
method whose implementation ignores a, and c and directly taken in Tensor Flow.js, for WebGl 1.0 devices, uses the
maps b and d into the 2D texture space. We observed that EXT_disjoint timer-query WebGL extension. This exten
this optimization leads to 1.3x speedup on average.
sion can be used to accurately measure the gpu time of
Last, there is a single Glsl implementation of tf matMul(
programs, but also implicitly has a bit that gets flipped when
regardless of the browsers WebGL capabilities. In Chrome
a program is done executing. The second approach, for We-
bGl 2.0 devices, uses the gl, fence Sync() API by inserting a
we render to a 32bit single-channel floating texture, while ln- fence into the gPu command queue and polling a query
iOS Safari we render to a 16bit single-channel floating point
texture. In both cases, the user code is the same, using the
hich returns true when the fence has dipped
high-level setOutput(value) glsl method with the browser
specific implementation generated by the compiler
4.1.2 Memory management
Disposing and re-allocating webGl textures is relatively
void main(
expensive, so we dont release memory when a tensor gets
i. coords getoutput coords ()i
disposed. Instead, we mark the texture for reuse. If another
int aRow ccordsx;
nt bCol= coordsy
tensor gets allocated with the same physical texture shape,
float result
0.0
we simply recycle the texture. The texture recycler gives us
for (int i=0; 1
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.