LISTEN ATTEND AND SPELL A NEURAL NETWORK FOR SPEEC

文件名称: LISTEN ATTEND AND SPELL A NEURAL NETWORK FOR SPEECH RECOGNITION.pdf

所属分类: 专业指导

开发工具:

文件大小: 632kb

下载次数: 0

上传时间: 2019-07-13

提供者: weixin_********

下载 (632kb)

不能下载？报告错误

详细说明：语音识别LAS结构where d and y, are MLP networks. After training, the a; distribution Table 1: WER comparison on the clean and noisy Google voice is typically very sharp and focuses on only a few frames of h; ci car search task. The CLDNN-hMM system is the state-of-the-art, the be seen as a continuous bay of weighted Teatures of h. Figure 1 Listen, Attend and Spell (LAS)models are decoded with a beam shows the las archilecture size of 32. Language Model (LM)rescoring can be beneficial Model Clean Wer Noisy WER 2. 3. Learning CLDNN-HMM[22]8.0 8.9 We train the parameters of our model to maximize the log probability LAS 14.1 16.5 of the correct sequences. Specifically LAS LM Rescoring103 12.0 0=max∑logP(mx,的;) (12) t, however we note that the cldnn uses unidirectional lstms where vi-1 is the ground previous character or a charac and would certainly benefit from the use of a blstm architecture ter randomly sampled(w probability) from the model,i.e Additionally the las model does not use convolutional filters which Character Distribution(si-1, Ci-l)using the procedure from [20]. have been reported to yield 5-7% WEr relative improvement [221 For the listen function we used 3 layers of 512 pBLSTM nodes 2.4. Decoding and rescorin ( e, 256 nodes per direction) on top of a BLSTM that operates on the input. This reduced the time resolution by 8=2 times.The During inference we want to find the most likely character sequence Spell function used a two layer LSTM with 512 nodes each.The given the input acoustics weights were initialized with a uniform distribution Z(0.1,0.1 Asynchronous Stochastic Grader y=arg max log P(yx (13) training our model [23]. A learning rate of 0. 2 was used with a ge- ometric decay of 0.98 per 3M utterances(i.e 1/20-th of an epoch) We use a simple left-to-right beam search similar to 8]. We can also We used the DistBelief framework [23] with 32 replicas, each with apply language models trained on large external text corpora alone, a minibatch of 32 utterances. In order to further speed up train similar to conventional speech systems [21]. We simply rescore our ing, the sequences were grouped into buckets based on their frame beams with the language model. We find that our model has a small length [8]. The model was trained until the results on the validation bias for shorter utterances so we normalize our probabilities by the set stopped improving, taking approximately two weeks. The model number of characters y c in the hypothesis and combine it with a was decoded using N-best list decoding with beam size of N-32 language model probability plm(y s(yx)、0gP(yx) 4. RESULTS AND DISCUSSION Alog Plm(y) (14) We achieved 14.1%o Wer on the clean test set and 16.5%C wer on where A is our language model weight and can be determined by a the noisy test set without any dictionary or language model. We held-out validation set found that constraining the beam search with a dictionary had no ER. Rescoring the top 32 beams with the same n- 3. EXPERIMENTS gram language model that was used by the ClDnn system using a language model weight of A= 0.008 improved the results for We used a dataset with three million Google Voice Search utterances the clean and noisy test sets to 10.3 %o and 12.0% respectively. Note (representing 2000 hours of data) for our experiments. Approxi- that for convenience, we did not decode with a language model, but mately 10 hours of utterances were randomly selected as a held-out rather only rescored the top 32 beams. It is possible that further validation set. Data augmentation was performed using a room sim- gains could have been achieved by using the language model during ulator,adding different types of noise and reverberations; the noise decoding. Table I summarizes the Wer results sources were obtained from You tube and environmental recordings The content-based attention mechanism creates an explicit of daily events [22]. This increased the amount of audio data by alignment between the characters and audio signal. We can visual 20 times with a Snr between 5dB and 30dB [22]. We used 40- ize the attention mechanism by recording the attention distribution dimensional log-mel filter bank features computed every 10ms as the on the acoustic sequence at every character output timestep. Fig- acoustic inputs to the listener. A separate set of 22K utterances repre ure 2 visualizes the attention alignment between the characters and senting approximately 16 hours of data were used as the test data. a the filterbanks for the utterance "how much would a woodchuck noisy test set was also created using the same corruption strategy that chuck". For this particular utterance, the model learnt a monotonic was applied to the training data. All training sets are anony mized and distribution without any location priors. The words"woodchuck?" hand-transcribed, and are representative of Google's speech traffic and"chuck" have acoustic similarities. the attention mechanism was The text was normalized by converting all characters to lower slightly confused when emitting "woodchuck "with a dilution in the case English alphanumerics (including digits). The punctuations: distribution. The attention model was also able to identify the start space, comma, period and apostrophe were kept, while all other to- and end of the utterance properly kens were converted to the unknown (unk) token. As mentioned We observed that LAS can learn multiple spelling variants given earlier, all utterances were padded with the start of- sentence(sos) the same acoustics. Table 2 shows top beams for the utterance that and the end-of-sentence(cos tokens includes "triple a. As can be seen, the model produces both"triple The state-of-the-art model on this dataset is a CLDNN-HMM a"and"aaa "within the top four beams. The decoder is able to gener system that was described in [22]. The Cldnn system achieves ate such varied parses, because the next step prediction model makes a WER of 8.0% on the clean test set and 8.9% on the noisy test no assumptions on the probability distribution by using the chain rule 4962 Alignment between the Characters and audio CTC has also been applied to end-to-end training with phoneme 题 targets and n-gram language models using FSTs in [26, 27, 6]. How ever, unlike the methods above, these methods use pronunciation dictionaries and language models within Fsts. End-lo-end training here implies training of the acoustic nodels with fixed dictionaries and language nodels, instead of training Models that recognize char- ≤ space acer sequences directly. In this respect these Models are end-Lo-end trained systeIns, rather than end-Lo-end nodels space While ctc has shown tremendous promise in end-LO-end peech recognition, it is limited by the assumptions of indepen- dence between frames-the output at one frame has no influence e at the outputs at the other frames- much like the unary potential of Conditional Random Fields. The only way to ameliorate this problem is through the use of a strong language model [2] The model proposed here is based on the sequence-to-sequence architecture [8, 10] and does not suffer from the above shortcomin LAS models the output sequence given the input sequence using the hain rule decomposition, starting at the first character As such this model makes no assumptions about the probability distribution and is only limited by the capacity of the recurrent neural network in modeling such a complicated distribution. Further, this single model Fig. 2: Alignments between character outputs and audio signal pro- encompasses all aspects of a speech recognition system -the acous- duced by the Listen, Attend and Spell (las) model for the utterance tic, pronunciation and language models are all encoded within its pa- how much would a woodchuck chuck". The content based allen- rameters. We argue that this makes it not only an end-to-end trained Lion mechanisIn was able Lo idenLify the starl position in the audio system, but an end-to-end model. This makes it a very powerful sequence for the first character correctly. The alignment produced is model for end-to-end speech recognition. Future work is likely to generally monotonic without a need for any location based priors plore how to use increasingly more complicated models for im- proved performance over what was achieved in this paper. Further, Table 2: Example 1: triple a"vs "aaa" spelling variants these models are likely to benefit from even larger datasets since the decoder is able to overfit the small number of transcripts Beam Text logp wer The model described in [14] is the closest to our model, with some slight differences. We use a pyramidal encoder while they use Truth call dad roadside assislance call aaa roadside assistance 0570.00 an encoder in which the higher layers subsample the hidden states of call triple a roadside assistance 1.54 50.00/ the layers below. In addition they use an FST to incorporate a lan- all trip way roadside assistance -3.5050.00 guage model, while we use language model rescoring and a length 4 call xxx roadside assistance dependent language model blending(see section 2. 4). We note that 4.44 25.00 these two works were performed concurrently and independently decomposition. It would be difficult to produce such differing tran 6. CONCLUSIONS scripts using CTC due to the conditional independence assumptions, We have presented Listen, Attend and Spell (LAS), a neural speech where the distribution of the output y/i at time i is conditionally inde recognizer that can transcribe acoustic signals to characters directl. pendent of distribution yi+1 at time i+ 1. Conventional DNN-HMM systems would require both spellings to be in the pronunciation dic without using any of the traditional components of a speech recogni- tionary to generate both transcriptions tion system, such as HMMs, language models and pronunciation dic- tionaries. We submit that it is not only an end-to-end trained system, but an end-to-end model. Las accomplishes this goal by making no 5. RELATED WORK conditional independence assumptions about the output sequence uS- ing the sequence-to-sequence framework. This distinguishes it from There has recently been an explosion in methods for end-to-end models like CTC, DNN-HMM and other models that can be trained trained speech models because of their inherent simplicity compared end-to-end but make various conditional independence assumptions to current speech recognition systems [2, 24, 6, 25, 14]. However to accomplish this. We showed how this model learns an implicit these methods have inherent shortcomings that our model attempts language model that can generate multiple spelling variants given the to address. here we describe in more detail the relationship between same acoustics. We also showed how an external language model, our work and prior approaches trained on additional text, can be used to re-rank the top hypotheses nitially. [2] showed that CTC could perform end-to-end speech We demonstrated that such an end-to-end model can be trained and recognition on WSJ, going straight from audio to character se be competitive with state-of-the-art CLDNN-HMM systems. We are quences. [24, 25] subsequently showed strong results with CTc optimistic that this approach will pave the way to new neural speech on larger datasets and Switchboard. However it was noted in [24, recognizers that are simpler to train and achieve even better accura- 2] that good accuracy could only be achieved through the use of a cies than the best current speech recognition systems trong language model during beam search decoding; the language models use are themselves fixed and trained independently of the I We note that we used three million utterances for training but that is d CtC objective very small corpus for an RNn language model 4963 7. REFERENCES [15 S Hochreiter and J Schmidhuber, " Long Short-Term Mem ory, Neural computati l.9,no.8,pp.1735-1780. [1] A Graves, A Mohamed, and G. Hinton, ""Speech Recogni- 1997. tion with Deep Recurrent Neural Networks, in IEEE Inter- national Conference on Acoustics, Speech and Signal Pro [16] A. Graves, N. Jaitly, and A. Mohamed, " Hybrid Speech cessing, 2013 Recognition with Bidirectional LSTM, in Automatic Speech Recognition and Understanding Workshop, 2013 [2] A. Graves and N. Jaitly, Towards End-to-End Speech Recognition with Recurrent Neural Networks, in Inter. [17] S. Hihi and Y Bengio, "Hierarchical Recurrent Neural Net- R national Conference on Machine learning, 2014 works for Long- Term Dependencies, "in Neural Information Processing Systems, 1996 [3 G. Hinton, L Deng, D. Yu, G. Dahl, A. Mohamed N. Jaitly A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. [8 J. Koutnik, K. Greff, F. Gomez, and j. Schmidhuber, "A Clockwork rnn, in International conference on Machine Kingsbury, Deep Neural Networks for Acoustic Modelin in Speech Recognition: The Shared Views of Four Research Learning, 2014 Groups, IEEE Signal Processing Magazine, Nov 2012 [19] Y Le Cun, L Botou, Y. Bengio, and P. Hafner, "Gradient- [4 K. Vesely, A Ghoshal, L. Burgel, and D. Povey, " Sequence based learning applied to document recognilion, Proceed discriminative training of deep neural networks, in INTER- ings of the IEEE, vul. 86, pp. 2278-2324, 11 Nov. 1998 SPEECH. 2013 [20] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, "Sched 5 H Sak, O. Vinyals, G. Heigold, A Senior, E. McDermott, R led Sampling for Sequence Prediction with Recurrent Neu- Monga, and M. Mao, "Sequence Discriminative Distributed al Networks, in Neural Information Processing systems 2015 Training of Long Short-Term Memory Recurrent Neural Net- works in INTERSPEECH. 201 4 [21] D. Povey, A Ghoshal, G. Boulianne, L Burget, O. Glembek, [6] Y Miao, M. Gowayyed, and F Metze, " EESEN: End-to-End N. Goel, M. Hannenmann, P. Motlicek, Y Qian, P. Schwarz Speech Recognition using Deep Rnn Models and WEST Silovsky, G. Stemmer, and K. Vesely, The Kaldi Speech based Decoding in Http : arxiv. org/abs/1507.08240, 2015 Recognition Toolkit in Automatic Speech Recognition and Understanding workshop 2011 [7 Y. Kubo, T Hori, and A Nakamura, Integrating Deep Neu- ral Networks into Structured Classification Approach based [22] T.N. Sainath,O. Vinyals, A. Senior, and H. Sak, Convo lutional, l ong short-Term Memory, Fully connected Dee on weighted finite-State transducers in INTerspeech 2012. Neural Networks, in /EFF International Conference on Acoustics, Speech and Signal Processing, 2015 「8. Sutskever,O.Ⅴ inyals,andQ.Le,“ Sequence to Sequence Learning with Neural Networks, in Neural Information Pro [23 J. Dean, G.S. Corrado, R Monga, K Chen, M. Devin, Q. v cessing systems, 2014. Le M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K.Yan and A.Y. Ng, "Large scale distributed Deep Networks, in [9 K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Neural Information Processing systems, 2012 Bougares, H. Schwen, and Y Bengio, "Learning Phrase Rep resentations using RNN Encoder-Decoder for Statistical Ma [24] A. Hannun, C Case, J. Casper, B Catanzaro, G. Diamos, E. chine translation’i ference on Empirical Methods in Elsen R. Prenger, S. Satheesh, S Sengupta, A Coates, and Natural Language Processing, 2014 A Ng, " Deep Speech: Scaling up end-to-end speech recog nition, in Http: arxiv. org/abs/1412.5567, 2014 [10 D. Bahdanau, k. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate, [25] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, exicon-tree conversational speech recognition with neural networks, in in International Conference on Learning representations, 2015 North American Chapter of the Association for computa tional linguistics, 2015 [I J. Chorowski, D. Bahdanau, K. Cho, and Y Bengio, "End-to- end Continuous Speech Recognition using Attention-based [26]H. Sak, A. Senior, K. Rao, O. Irsoy, A. Gr Recurrent NN: First Results, in Neural Information Process fays and Schalkwyk,"Learning acoustic frame laheling ing System.: Work shop Deep learning and Representation for speech recognition with recurrent neural networks,in Learning Workshop, 2014 TEEE International Conference on Acoustics, Speech, and Signal Processing, 2015 [12] J. Chorowski, D. Bahdanau, D Serdyuk, k. Cho, and Y. Ben gio, "Attention-Based Models for Speech Recognition, "in /271 H. Sak, A Senior, K. Rao, and F Beaufays, Fast and Accu Neural Information Processing SystemS, 2015 rate Recurrent Neural Network Acoustic Models for Speech Recognition in INTERSPEECH. 2015 [13 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdi- nov, R. Zemel, and Y Bengio, Show, Attend and Tell: Neu- ral Image Caption Generation with Visual Attention, inIn- ternational Conference on Machine learning, 2015 [14] D. Bahdanau, J. Chorowskl, D. Serdyuk, P. Brakel, and Y. Bengio, "End-to-end attention-based large vocabulary speech recognition in Http: arxiv. org/abs/1508.04395, 2015 4964

(系统自动生成,下载前可以参看下载内容)