文件名称:
LISTEN ATTEND AND SPELL A NEURAL NETWORK FOR SPEECH RECOGNITION.pdf
开发工具:
文件大小: 632kb
下载次数: 0
上传时间: 2019-07-13
详细说明:语音识别LAS结构where d and y, are MLP networks. After training, the a; distribution Table 1: WER comparison on the clean and noisy Google voice
is typically very sharp and focuses on only a few frames of h; ci car
search task. The CLDNN-hMM system is the state-of-the-art, the
be seen as a continuous bay of weighted Teatures of h. Figure 1
Listen, Attend and Spell (LAS)models are decoded with a beam
shows the las archilecture
size of 32. Language Model (LM)rescoring can be beneficial
Model
Clean Wer Noisy WER
2. 3. Learning
CLDNN-HMM[22]8.0
8.9
We train the parameters of our model to maximize the log probability
LAS
14.1
16.5
of the correct sequences. Specifically
LAS LM Rescoring103
12.0
0=max∑logP(mx,的;)
(12)
t, however we note that the cldnn uses unidirectional lstms
where vi-1 is the ground
previous character or a charac
and would certainly benefit from the use of a blstm architecture
ter randomly sampled(w
probability) from the model,i.e
Additionally the las model does not use convolutional filters which
Character Distribution(si-1, Ci-l)using the procedure from [20]. have been reported to yield 5-7% WEr relative improvement [221
For the listen function we used 3 layers of 512 pBLSTM nodes
2.4. Decoding and rescorin
( e, 256 nodes per direction) on top of a BLSTM that operates on
the input. This reduced the time resolution by 8=2 times.The
During inference we want to find the most likely character sequence Spell function used a two layer LSTM with 512 nodes each.The
given the input acoustics
weights were initialized with a uniform distribution Z(0.1,0.1
Asynchronous Stochastic Grader
y=arg max log P(yx
(13)
training our model [23]. A learning rate of 0. 2 was used with a ge-
ometric decay of 0.98 per 3M utterances(i.e 1/20-th of an epoch)
We use a simple left-to-right beam search similar to 8]. We can also We used the DistBelief framework [23] with 32 replicas, each with
apply language models trained on large external text corpora alone, a minibatch of 32 utterances. In order to further speed up train
similar to conventional speech systems [21]. We simply rescore our ing, the sequences were grouped into buckets based on their frame
beams with the language model. We find that our model has a small length [8]. The model was trained until the results on the validation
bias for shorter utterances so we normalize our probabilities by the set stopped improving, taking approximately two weeks. The model
number of characters y c in the hypothesis and combine it with a was decoded using N-best list decoding with beam size of N-32
language model probability plm(y
s(yx)、0gP(yx)
4. RESULTS AND DISCUSSION
Alog Plm(y)
(14)
We achieved 14.1%o Wer on the clean test set and 16.5%C wer on
where A is our language model weight and can be determined by a the noisy test set without any dictionary or language model. We
held-out validation set
found that constraining the beam search with a dictionary had no
ER. Rescoring the top 32 beams with the same n-
3. EXPERIMENTS
gram language model that was used by the ClDnn system using
a language model weight of A= 0.008 improved the results for
We used a dataset with three million Google Voice Search utterances the clean and noisy test sets to 10.3 %o and 12.0% respectively. Note
(representing 2000 hours of data) for our experiments. Approxi- that for convenience, we did not decode with a language model, but
mately 10 hours of utterances were randomly selected as a held-out rather only rescored the top 32 beams. It is possible that further
validation set. Data augmentation was performed using a room sim- gains could have been achieved by using the language model during
ulator,adding different types of noise and reverberations; the noise decoding. Table I summarizes the Wer results
sources were obtained from You tube and environmental recordings
The content-based attention mechanism creates an explicit
of daily events [22]. This increased the amount of audio data by alignment between the characters and audio signal. We can visual
20 times with a Snr between 5dB and 30dB [22]. We used 40- ize the attention mechanism by recording the attention distribution
dimensional log-mel filter bank features computed every 10ms as the on the acoustic sequence at every character output timestep. Fig-
acoustic inputs to the listener. A separate set of 22K utterances repre
ure 2 visualizes the attention alignment between the characters and
senting approximately 16 hours of data were used as the test data. a the filterbanks for the utterance "how much would a woodchuck
noisy test set was also created using the same corruption strategy that chuck". For this particular utterance, the model learnt a monotonic
was applied to the training data. All training sets are anony mized and distribution without any location priors. The words"woodchuck?"
hand-transcribed, and are representative of Google's speech traffic
and"chuck" have acoustic similarities. the attention mechanism was
The text was normalized by converting all characters to lower slightly confused when emitting "woodchuck "with a dilution in the
case English alphanumerics (including digits). The punctuations: distribution. The attention model was also able to identify the start
space, comma, period and apostrophe were kept, while all other to- and end of the utterance properly
kens were converted to the unknown (unk) token. As mentioned
We observed that LAS can learn multiple spelling variants given
earlier, all utterances were padded with the start of- sentence(sos) the same acoustics. Table 2 shows top beams for the utterance that
and the end-of-sentence(cos tokens
includes "triple a. As can be seen, the model produces both"triple
The state-of-the-art model on this dataset is a CLDNN-HMM a"and"aaa "within the top four beams. The decoder is able to gener
system that was described in [22]. The Cldnn system achieves ate such varied parses, because the next step prediction model makes
a WER of 8.0% on the clean test set and 8.9% on the noisy test no assumptions on the probability distribution by using the chain rule
4962
Alignment between the Characters and audio
CTC has also been applied to end-to-end training with phoneme
题
targets and n-gram language models using FSTs in [26, 27, 6]. How
ever, unlike the methods above, these methods use pronunciation
dictionaries and language models within Fsts. End-lo-end training
here implies training of the acoustic nodels with fixed dictionaries
and language nodels, instead of training Models that recognize char-
≤ space
acer sequences directly. In this respect these Models are end-Lo-end
trained systeIns, rather than end-Lo-end nodels
space
While ctc has shown tremendous promise in end-LO-end
peech recognition, it is limited by the assumptions of indepen-
dence between frames-the output at one frame has no influence
e
at the outputs at the other frames- much like the unary potential
of Conditional Random Fields. The only way to ameliorate this
problem is through the use of a strong language model [2]
The model proposed here is based on the sequence-to-sequence
architecture [8, 10] and does not suffer from the above shortcomin
LAS models the output sequence given the input sequence using the
hain rule decomposition, starting at the first character As such this
model makes no assumptions about the probability distribution and
is only limited by the capacity of the recurrent neural network in
modeling such a complicated distribution. Further, this single model
Fig. 2: Alignments between character outputs and audio signal pro- encompasses all aspects of a speech recognition system -the acous-
duced by the Listen, Attend and Spell (las) model for the utterance tic, pronunciation and language models are all encoded within its pa-
how much would a woodchuck chuck". The content based allen- rameters. We argue that this makes it not only an end-to-end trained
Lion mechanisIn was able Lo idenLify the starl position in the audio system, but an end-to-end model. This makes it a very powerful
sequence for the first character correctly. The alignment produced is model for end-to-end speech recognition. Future work is likely to
generally monotonic without a need for any location based priors
plore how to use increasingly more complicated models for im-
proved performance over what was achieved in this paper. Further,
Table 2: Example 1: triple a"vs "aaa" spelling variants
these models are likely to benefit from even larger datasets since the
decoder is able to overfit the small number of transcripts
Beam Text
logp wer
The model described in [14] is the closest to our model, with
some slight differences. We use a pyramidal encoder while they use
Truth call dad roadside assislance
call aaa roadside assistance
0570.00
an encoder in which the higher layers subsample the hidden states of
call triple a roadside assistance
1.54 50.00/ the layers below. In addition they use an FST to incorporate a lan-
all trip way roadside assistance
-3.5050.00
guage model, while we use language model rescoring and a length
4
call xxx roadside assistance
dependent language model blending(see section 2. 4). We note that
4.44 25.00 these two works were performed concurrently and independently
decomposition. It would be difficult to produce such differing tran
6. CONCLUSIONS
scripts using CTC due to the conditional independence assumptions, We have presented Listen, Attend and Spell (LAS), a neural speech
where the distribution of the output y/i at time i is conditionally inde
recognizer that can transcribe acoustic signals to characters directl.
pendent of distribution yi+1 at time i+ 1. Conventional DNN-HMM
systems would require both spellings to be in the pronunciation dic
without using any of the traditional components of a speech recogni-
tionary to generate both transcriptions
tion system, such as HMMs, language models and pronunciation dic-
tionaries. We submit that it is not only an end-to-end trained system,
but an end-to-end model. Las accomplishes this goal by making no
5. RELATED WORK
conditional independence assumptions about the output sequence uS-
ing the sequence-to-sequence framework. This distinguishes it from
There has recently been an explosion in methods for end-to-end models like CTC, DNN-HMM and other models that can be trained
trained speech models because of their inherent simplicity compared end-to-end but make various conditional independence assumptions
to current speech recognition systems [2, 24, 6, 25, 14]. However to accomplish this. We showed how this model learns an implicit
these methods have inherent shortcomings that our model attempts language model that can generate multiple spelling variants given the
to address. here we describe in more detail the relationship between same acoustics. We also showed how an external language model,
our work and prior approaches
trained on additional text, can be used to re-rank the top hypotheses
nitially. [2] showed that CTC could perform end-to-end speech We demonstrated that such an end-to-end model can be trained and
recognition on WSJ, going straight from audio to character se
be competitive with state-of-the-art CLDNN-HMM systems. We are
quences. [24, 25] subsequently showed strong results with CTc optimistic that this approach will pave the way to new neural speech
on larger datasets and Switchboard. However it was noted in [24, recognizers that are simpler to train and achieve even better accura-
2] that good accuracy could only be achieved through the use of a cies than the best current speech recognition systems
trong language model during beam search decoding; the language
models use are themselves fixed and trained independently of the
I We note that we used three million utterances for training but that is d
CtC objective
very small corpus for an RNn language model
4963
7. REFERENCES
[15 S Hochreiter and J Schmidhuber, " Long Short-Term Mem
ory, Neural computati
l.9,no.8,pp.1735-1780.
[1] A Graves, A Mohamed, and G. Hinton, ""Speech Recogni-
1997.
tion with Deep Recurrent Neural Networks, in IEEE Inter-
national Conference on Acoustics, Speech and Signal Pro
[16] A. Graves, N. Jaitly, and A. Mohamed, " Hybrid Speech
cessing, 2013
Recognition with Bidirectional LSTM, in Automatic Speech
Recognition and Understanding Workshop, 2013
[2] A. Graves and N. Jaitly, Towards End-to-End Speech
Recognition with Recurrent Neural Networks, in Inter. [17] S. Hihi and Y Bengio, "Hierarchical Recurrent Neural Net-
R
national Conference on Machine learning, 2014
works for Long- Term Dependencies, "in Neural Information
Processing Systems, 1996
[3 G. Hinton, L Deng, D. Yu, G. Dahl, A. Mohamed N. Jaitly
A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B.
[8 J. Koutnik, K. Greff, F. Gomez, and j. Schmidhuber, "A
Clockwork rnn, in International conference on Machine
Kingsbury, Deep Neural Networks for Acoustic Modelin
in Speech Recognition: The Shared Views of Four Research
Learning, 2014
Groups, IEEE Signal Processing Magazine, Nov 2012
[19] Y Le Cun, L Botou, Y. Bengio, and P. Hafner, "Gradient-
[4 K. Vesely, A Ghoshal, L. Burgel, and D. Povey, " Sequence
based learning applied to document recognilion, Proceed
discriminative training of deep neural networks, in INTER-
ings of the IEEE, vul. 86, pp. 2278-2324, 11 Nov. 1998
SPEECH. 2013
[20] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, "Sched
5 H Sak, O. Vinyals, G. Heigold, A Senior, E. McDermott, R
led Sampling for Sequence Prediction with Recurrent Neu-
Monga, and M. Mao, "Sequence Discriminative Distributed
al Networks, in Neural Information Processing systems
2015
Training of Long Short-Term Memory Recurrent Neural Net-
works in INTERSPEECH. 201 4
[21] D. Povey, A Ghoshal, G. Boulianne, L Burget, O. Glembek,
[6] Y Miao, M. Gowayyed, and F Metze, " EESEN: End-to-End
N. Goel, M. Hannenmann, P. Motlicek, Y Qian, P. Schwarz
Speech Recognition using Deep Rnn Models and WEST
Silovsky, G. Stemmer, and K. Vesely, The Kaldi Speech
based Decoding in Http : arxiv. org/abs/1507.08240, 2015
Recognition Toolkit in Automatic Speech Recognition and
Understanding workshop 2011
[7 Y. Kubo, T Hori, and A Nakamura, Integrating Deep Neu-
ral Networks into Structured Classification Approach based [22] T.N. Sainath,O. Vinyals, A. Senior, and H. Sak, Convo
lutional, l ong short-Term Memory, Fully connected Dee
on weighted finite-State transducers in INTerspeech
2012.
Neural Networks, in /EFF International Conference on
Acoustics, Speech and Signal Processing, 2015
「8. Sutskever,O.Ⅴ inyals,andQ.Le,“ Sequence to Sequence
Learning with Neural Networks, in Neural Information Pro
[23 J. Dean, G.S. Corrado, R Monga, K Chen, M. Devin, Q. v
cessing systems, 2014.
Le M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K.Yan
and A.Y. Ng, "Large scale distributed Deep Networks, in
[9 K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F.
Neural Information Processing systems, 2012
Bougares, H. Schwen, and Y Bengio, "Learning Phrase Rep
resentations using RNN Encoder-Decoder for Statistical Ma
[24] A. Hannun, C Case, J. Casper, B Catanzaro, G. Diamos, E.
chine translation’i
ference on Empirical Methods in
Elsen R. Prenger, S. Satheesh, S Sengupta, A Coates, and
Natural Language Processing, 2014
A Ng, " Deep Speech: Scaling up end-to-end speech recog
nition, in Http: arxiv. org/abs/1412.5567, 2014
[10 D. Bahdanau, k. Cho, and Y. Bengio, "Neural Machine
Translation by Jointly Learning to Align and Translate,
[25] A. Maas, Z. Xie, D. Jurafsky, and A. Ng,
exicon-tree
conversational speech recognition with neural networks, in
in International Conference on Learning representations,
2015
North American Chapter of the Association for computa
tional linguistics, 2015
[I J. Chorowski, D. Bahdanau, K. Cho, and Y Bengio, "End-to-
end Continuous Speech Recognition using Attention-based
[26]H. Sak, A. Senior, K. Rao, O. Irsoy, A. Gr
Recurrent NN: First Results, in Neural Information Process
fays and Schalkwyk,"Learning acoustic frame laheling
ing System.: Work shop Deep learning and Representation
for speech recognition with recurrent neural networks,in
Learning Workshop, 2014
TEEE International Conference on Acoustics, Speech, and
Signal Processing, 2015
[12] J. Chorowski, D. Bahdanau, D Serdyuk, k. Cho, and Y. Ben
gio, "Attention-Based Models for Speech Recognition, "in /271 H. Sak, A Senior, K. Rao, and F Beaufays, Fast and Accu
Neural Information Processing SystemS, 2015
rate Recurrent Neural Network Acoustic Models for Speech
Recognition in INTERSPEECH. 2015
[13 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdi-
nov, R. Zemel, and Y Bengio, Show, Attend and Tell: Neu-
ral Image Caption Generation with Visual Attention, inIn-
ternational Conference on Machine learning, 2015
[14] D. Bahdanau, J. Chorowskl, D. Serdyuk, P. Brakel, and
Y. Bengio, "End-to-end attention-based large vocabulary
speech recognition in Http: arxiv. org/abs/1508.04395,
2015
4964
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.