The Application of Hidden Markov Models in Speech

文件名称: The Application of Hidden Markov Models in Speech Recognition

所属分类: 深度学习

开发工具:

文件大小: 617kb

下载次数: 0

上传时间: 2019-03-17

提供者: long*****

下载 (617kb)

不能下载？报告错误

详细说明：Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. Whereas the basic principles underlying HMM-based LVCSR are rather straightforward, the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment. Thus, the practical application of HMMs in modern systems involves considerable sophistication. The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then describe the various refinements which are needed to achieve state-of-the-art performance. These refinements include feature projection, improved covariance modelling, discriminative parameter estimation, adaptation and normalisation, noise compensation and multi-pass system combination. The review concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described.Introduction Automatic continuous speech recognition(CSr) has many ly potential applications including command and control, dictation, transcription of recorded speech, searching audio documents and interactive spoken dialogues. The core of all speech recognition systems consists of a set of statistical models representing the various sounds of the language to be recognised Since speech has temporal structure and can be encoded as a sequence of spectral vectors spanning the audio frequency range the hidden Markov model (hMm) provides a natural framework for constructing such models 13 HMMs lie at the heart of virtually all modern speech recognition systems and although the basic framework has not changed significantly in the last decade or more, the detailed modelling techniques developed within this framework have cvolvcd to a statc of considcrablc sophisti- cation(e.g.40. 117, 163). The result has been steady and significant progress and it is the aim of this review to describe the main techniques by which this has been achieved The foundations of modcrn HMM-bascd continuous spccch recog nition technology were laid down in the 1970s by groups at Carnegie Mellon and iB who introduced the use of discrete density hms 197 198 Introduction 11, 77, 108, and then later at Bell Labs [ 80, 81, 99 where continu ous density HMMs were introduced. I An excellent tutorial covering the asic HMM technologies developed in this period is given in [141 Reflecting the computational power of the time, initial develop ment in the 1980s focussed on either discrete word speaker dependent large vocabulary systems(e.g. [78) or whole word small vocabulary speaker independent applications(e.g.[142 ) In the early 90s, atten Lion switched to continuous speaker-independent recognition. Start- ing with the artificial 1000 word Resource Management task [140] the technology developed rapidly and by the mid-1990s, reasonable accuracy was being achieved for unrestricted speaker independent dic tation. Much of this development was driven by a series of DArPA and NSA programmes [188 which set ever more challenging tasks culminating most recently in systems for multilingual transcription of broadcast news programmes[134 and for spontaneous telephone conversations 62 Many research groups have contributed to this progress, and each will typically have its own architectural perspective. For the sake of log ical coherence, the presentation given here is somewhat biassed towards the architecture developed at Cambridge University and supported b the hTK Software Toolkit [1892 The review is organised as follows. Firstly, in Architecture of a HMM-Based Recogmiser the key architectural ideas of a typical hMM based recogniser are described. The intention here is to present an over all system design using very basic acoustic models. In particular, simple singlc Gaussian diagonal covariance HMMs arc assumcd. The following section HMM Structure Refinements then describes the various ways in which the limitations of these basic hmms can be overcome. for exam- ple by transforming features and using more complex HMM output distributions. A key benefit of the statistical approach to speech reco nition is that the required models are trained automatically on data This very brief historical perspective is far from complete and out of necessity omits many other important contributions to the early years of HMM-based speech recognition 2 Available for free download at htk am.acuk. This includes a recipe for building a tate-of-the-art recogniser for the resource management task which illustrates a number of t he approaches described in t his review 199 The section Parameter Estimation discusses the different objective functions that can be optimised in training and their effects on perfor mance. Any system designed to work reliably in real-world applications must be robust to changes in speaker and the environment. The section on Adaptation and normalisation presents a variety of generic tech- niques for achieving robustness. The following section Noise Robust mess then discusses more specialised techniques for specifically handling additive and convolutional noise. The section Mulli-Puss RecognilioT Architectures returns to the topic of the overall system architecture and explains how multiple passes over the speech signal using differ ent model combinations can be exploited to further improve perfor- mance. This final section also describes some actual systems built for transcribing English, Mandarin and Arabic in order to illustrate the various techniques discussed in the review The review concludes in Conclusions with some general observations and conclusions 2 Architecture of an HMM-Based Recogniser The principal components of a large vocabulary continuous speech recogniser are illustrated in Figure 2. 1. The input audio waveform Iror a microphone is converted into a sequence of fixed size acoustic vectors Y1: T=91,., yr in a process called feature extraction. The decoder chen attempts to find the sequence of words w1: L=w1,.,wr which is most likely to have generated r i.e. the decoder uries to find w=arg maxP(wY)) However, since P(wY) is dilficult to model directly, Bayes'Rule is used to transform(2. 1)into the equivalent problem of finding w= arg maxp(Yw)P(wh (22) The likelihood p(rw) is determined by an acoustic model and the prior P(w)is determined by a language model. 2 The basic unit of sound 1 There are some systems that are based on discriminative models 54 where P(wY)is modelled directly, rather than using generative models, such as HMMs, where the obser vation sequence is modelled, p(ru 2In practice, the acoustic model is not normalised and the language model is often scaled by an empirically determined constant and a word insertion penalty is added i. e, in the log domain the total likelihood is calculated as logp(Ylo)+ alog(P(a))+ Ba where ce typically in t he range 8-20 and B is typica l y in t he range 0--20 200 2.1 Feature Extraction 201 ature Vectors Word Stop that m Feature Decoder Extraction L: Models Dictio Modcl Fig 2.1 Architecture of a HMM-based Recogniser represented by the acoustic model is the phone. For example, the word bat "is composed of three phones b//ae//t/. About 40 such phones are required for Fn For any given w, the corresponding acoustic model is synthe sised by concatenating phone models to make words as defined by a pronunciation dictionary. The parameters of these phone models are estimated from training data consisting of speech waveforms and Cheir orthographic transcriptiOns. The language nodel is ty pically an N-gram model in which the probability of each word is cond tioned only on its N-1 predecessors. The N-gram parameters are estimated by counting N-tuples in appropriate text corpora. The decoder operales by searching through all possible word sequences using pruning to remove unlikely hypotheses thereby keeping the search tractable. When the end of the utterance is reached, the most likely word sequence is output. Alternatively, modern decoders can gener- ale lattices containing a compact representation of the most likely hypoth The following sections describe these processes and components 2.1 Feature Extraction The feature extraction stage seeks to provide a compact representa- tion of the speech waveform. This form should minimise the loss of 202 Architecture of an HMM-Based Recogniser information that discriminates between words, and provide a good match with the distributional assumptions made by the acoustic mod els. For example, if diagona.I covariance Gaussian distributions are used for the state-output distributions then the features should be designed to be gaussian and uncorrelated Feature vectors are typically computed every 10 ms using an over lapping analysis window of around 25 ms. One of the simplest and Inost widely used encoding schenes is based on mel-/ requency cep stral coefficients (MFCCs)32. These are generated by applying a truncated discrete cosine transformation(Dcr)to a log spectral esti- mate computed by smoothing an Fft with around 20 frequency bins distributed non-linearly across the speech spectrum. The non- linear frequency scale used is called a mel scale and it approxi- mates the response of the human ear. The Dct is applied in order to smooth the spectral estimate and approximately decorrelate the Teature elements. Alter the cosine transform the first element rep- resents the average of the log-energy of the frequency bins. This is sometimes replaced by the log-energy of the frame, or removed completely Further psychoacoustic constraints are incorporated into a related encoding called perceptual linear prediction (PLP) 74. PLP com- putes linear prediction coefficients from a perceptually weighted non-linearly compressed power spectrum and then transforms the linear prediction coefficients to cepstral coefficients. In practice PLP can give small improvements over MFCCs, especially in noisy environments and hence it is the preferred encoding for many systems 185 In addition to the spectral coefficients, first order (delta)and second-order(delta-delta) regression coefficients are often appended in a heuristic attempt to compensate for the conditional independence assumption made by the HMM-based acoustic models [47. If the orig inal (static) feature vector is yi, then the delta parameter, Ays, is wi yt yt 2.3 i=1 2.2 HMM Acoustic Models(Basic-Single Component) 203 where n is the window width and wi are the regression coefficients. 3 The delta delta parameters, Af, are derived in the same fashion, but using differences of the delta parameters. When concatenated together these form the featu ector 3/ T T yt (24) The final result is a feature vector whose dimensionality is typically around 40 and which has been partially but not fully decorrelated 2.2 HMM Acoustic Models(Basic-Single Component As noted above, each spoken word w is decomposed into a sequence of Km basic sounds called base phones. This sequence is called its pro- nunciation i:k=q1, .. 9K. To allow for thc possibility of multiplo pronunciations, the likelihood p(r w) can be computed over multiple pronunciations Y)=∑p(Y)PQl (25) where the summation is over all valid pronunciation sequences for a17 Q is a particular sequence of pronunciations (Q)=IP(q0) (26) L=1 and where each q(u) is a valid pronunciation for word wl. In practice therc will only bc a vcry small numbcr of alternative pronunciations for each w! making the summation in(2.5) easily tractable Fach base phone q is represented by a continuous density hmm of the form illustrated in Figure 2.2 with transition probability param- ctors laii) and output obscrvation distributions 16i0J. In opcration an hMm makes a transition from its current state to one of its con nected states every time step The probability of making a particular BIn htk to ensure that the same number of frames is maintained after adding delta and delta-delta parameters, the start and end elements are replicated to fill the regression A Recognisers often approximate this by a mac operation so that alternative pronunciations 204 Architecture of an HMM-Based Recogniser a a Mode a12/ a23 a Vec Sequence r vi,b2y2); -y3)4C4) B45) y y Fig. 2. 2 HMM-based phone model transition from slate s, to state s, is given by the transition probabil- ity laii. On entering a state, a feature vector is generated using the distribution associated with the state being entered, bi 0. This form of process yields the standard conditional independence assumptions for an hmm: states are conditionally independent of all other states given the prcvious state observations are conditionally independent of all other obser vations given the state that generated it For a more detailed discussion of the operation of an HMM see 141 For now. single multivariate gaussians will be assumed for the out put distribution b;(y)-N:10),∑0), (27 where(i)is the mean of state s, and 2() is its covariance. Since the dimensionality of the acoustic vector y is relatively high, the covariances are often constrained to be diagonal. Later in HMM Structure Refine ments, the benefits of using mixtures of gaussians will be discussed Given the composite HMM Q formed by concatenating all of the constituent base phones q(ui)., q(uz) then the acoustic likelihood is YQ)=∑mO,yQ 28

(系统自动生成,下载前可以参看下载内容)