文件名称:
The Application of Hidden Markov Models in Speech Recognition
开发工具:
文件大小: 617kb
下载次数: 0
上传时间: 2019-03-17
详细说明:Hidden Markov Models (HMMs) provide a simple and effective framework
for modelling time-varying spectral vector sequences. As a consequence,
almost all present day large vocabulary continuous speech
recognition (LVCSR) systems are based on HMMs.
Whereas the basic principles underlying HMM-based LVCSR are
rather straightforward, the approximations and simplifying assumptions
involved in a direct implementation of these principles would
result in a system which has poor accuracy and unacceptable sensitivity
to changes in operating environment. Thus, the practical
application of HMMs in modern systems involves considerable
sophistication.
The aim of this review is first to present the core architecture of
a HMM-based LVCSR system and then describe the various refinements
which are needed to achieve state-of-the-art performance. These
refinements include feature projection, improved covariance modelling,
discriminative parameter estimation, adaptation and normalisation,
noise compensation and multi-pass system combination. The review
concludes with a case study of LVCSR for Broadcast News and
Conversation transcription in order to illustrate the techniques
described.Introduction
Automatic continuous speech recognition(CSr) has many
ly potential
applications including command and control, dictation, transcription
of recorded speech, searching audio documents and interactive spoken
dialogues. The core of all speech recognition systems consists of a set
of statistical models representing the various sounds of the language to
be recognised Since speech has temporal structure and can be encoded
as a sequence of spectral vectors spanning the audio frequency range
the hidden Markov model (hMm) provides a natural framework for
constructing such models 13
HMMs lie at the heart of virtually all modern speech recognition
systems and although the basic framework has not changed significantly
in the last decade or more, the detailed modelling techniques developed
within this framework have cvolvcd to a statc of considcrablc sophisti-
cation(e.g.40. 117, 163). The result has been steady and significant
progress and it is the aim of this review to describe the main techniques
by which this has been achieved
The foundations of modcrn HMM-bascd continuous spccch recog
nition technology were laid down in the 1970s by groups at Carnegie
Mellon and iB who introduced the use of discrete density hms
197
198 Introduction
11, 77, 108, and then later at Bell Labs [ 80, 81, 99 where continu
ous density HMMs were introduced. I An excellent tutorial covering the
asic HMM technologies developed in this period is given in [141
Reflecting the computational power of the time, initial develop
ment in the 1980s focussed on either discrete word speaker dependent
large vocabulary systems(e.g. [78) or whole word small vocabulary
speaker independent applications(e.g.[142 ) In the early 90s, atten
Lion switched to continuous speaker-independent recognition. Start-
ing with the artificial 1000 word Resource Management task [140]
the technology developed rapidly and by the mid-1990s, reasonable
accuracy was being achieved for unrestricted speaker independent dic
tation. Much of this development was driven by a series of DArPA
and NSA programmes [188 which set ever more challenging tasks
culminating most recently in systems for multilingual transcription
of broadcast news programmes[134 and for spontaneous telephone
conversations 62
Many research groups have contributed to this progress, and each
will typically have its own architectural perspective. For the sake of log
ical coherence, the presentation given here is somewhat biassed towards
the architecture developed at Cambridge University and supported b
the hTK Software Toolkit [1892
The review is organised as follows. Firstly, in Architecture of a
HMM-Based Recogmiser the key architectural ideas of a typical hMM
based recogniser are described. The intention here is to present an over
all system design using very basic acoustic models. In particular, simple
singlc Gaussian diagonal covariance HMMs arc assumcd. The following
section HMM Structure Refinements then describes the various ways in
which the limitations of these basic hmms can be overcome. for exam-
ple by transforming features and using more complex HMM output
distributions. A key benefit of the statistical approach to speech reco
nition is that the required models are trained automatically on data
This very brief historical perspective is far from complete and out of necessity omits many
other important contributions to the early years of HMM-based speech recognition
2 Available for free download at htk
am.acuk. This includes a recipe for building a
tate-of-the-art recogniser for the resource management task which illustrates a number
of t he approaches described in t his review
199
The section Parameter Estimation discusses the different objective
functions that can be optimised in training and their effects on perfor
mance. Any system designed to work reliably in real-world applications
must be robust to changes in speaker and the environment. The section
on
Adaptation and normalisation presents a variety of generic tech-
niques for achieving robustness. The following section Noise Robust
mess then discusses more specialised techniques for specifically handling
additive and convolutional noise. The section Mulli-Puss RecognilioT
Architectures returns to the topic of the overall system architecture
and explains how multiple passes over the speech signal using differ
ent model combinations can be exploited to further improve perfor-
mance. This final section also describes some actual systems built for
transcribing English, Mandarin and Arabic in order to illustrate the
various techniques discussed in the review The review concludes in
Conclusions with some general observations and conclusions
2
Architecture of an HMM-Based Recogniser
The principal components of a large vocabulary continuous speech
recogniser are illustrated in Figure 2. 1. The input audio waveform Iror
a microphone is converted into a sequence of fixed size acoustic vectors
Y1: T=91,., yr in a process called feature extraction. The decoder
chen attempts to find the sequence of words w1: L=w1,.,wr which
is most likely to have generated r i.e. the decoder uries to find
w=arg maxP(wY))
However, since P(wY) is dilficult to model directly, Bayes'Rule is
used to transform(2. 1)into the equivalent problem of finding
w= arg maxp(Yw)P(wh
(22)
The likelihood p(rw) is determined by an acoustic model and the
prior P(w)is determined by a language model. 2 The basic unit of sound
1 There are some systems that are based on discriminative models 54 where P(wY)is
modelled directly, rather than using generative models, such as HMMs, where the obser
vation sequence is modelled, p(ru
2In practice, the acoustic model is not normalised and the language model is often scaled
by an empirically determined constant and a word insertion penalty is added i. e, in the
log domain the total likelihood is calculated as logp(Ylo)+ alog(P(a))+ Ba where ce
typically in t he range 8-20 and B is typica l y in t he range 0--20
200
2.1 Feature Extraction 201
ature
Vectors
Word
Stop that m
Feature
Decoder
Extraction
L:
Models
Dictio
Modcl
Fig 2.1 Architecture of a HMM-based Recogniser
represented by the acoustic model is the phone. For example, the word
bat "is composed of three phones b//ae//t/. About 40 such phones
are required for Fn
For any given w, the corresponding acoustic model is synthe
sised by concatenating phone models to make words as defined by
a pronunciation dictionary. The parameters of these phone models
are estimated from training data consisting of speech waveforms and
Cheir orthographic transcriptiOns. The language nodel is ty pically
an N-gram model in which the probability of each word is cond
tioned only on its N-1 predecessors. The N-gram parameters are
estimated by counting N-tuples in appropriate text corpora. The
decoder operales by searching through all possible word sequences
using pruning to remove unlikely hypotheses thereby keeping the search
tractable. When the end of the utterance is reached, the most likely
word sequence is output. Alternatively, modern decoders can gener-
ale lattices containing a compact representation of the most likely
hypoth
The following sections describe these processes and components
2.1 Feature Extraction
The feature extraction stage seeks to provide a compact representa-
tion of the speech waveform. This form should minimise the loss of
202 Architecture of an HMM-Based Recogniser
information that discriminates between words, and provide a good
match with the distributional assumptions made by the acoustic mod
els. For example, if diagona.I covariance Gaussian distributions are used
for the state-output distributions then the features should be designed
to be gaussian and uncorrelated
Feature vectors are typically computed every 10 ms using an over
lapping analysis window of around 25 ms. One of the simplest and
Inost widely used encoding schenes is based on mel-/ requency cep
stral coefficients (MFCCs)32. These are generated by applying a
truncated discrete cosine transformation(Dcr)to a log spectral esti-
mate computed by smoothing an Fft with around 20 frequency
bins distributed non-linearly across the speech spectrum. The non-
linear frequency scale used is called a mel scale and it approxi-
mates the response of the human ear. The Dct is applied in order
to smooth the spectral estimate and approximately decorrelate the
Teature elements. Alter the cosine transform the first element rep-
resents the average of the log-energy of the frequency bins. This
is sometimes replaced by the log-energy of the frame, or removed
completely
Further psychoacoustic constraints are incorporated into a related
encoding called perceptual linear prediction (PLP) 74. PLP com-
putes linear prediction coefficients from a perceptually weighted
non-linearly compressed power spectrum and then transforms the
linear prediction coefficients to cepstral coefficients. In practice
PLP can give small improvements over MFCCs, especially in noisy
environments and hence it is the preferred encoding for many
systems 185
In addition to the spectral coefficients, first order (delta)and
second-order(delta-delta) regression coefficients are often appended
in a heuristic attempt to compensate for the conditional independence
assumption made by the HMM-based acoustic models [47. If the orig
inal (static) feature vector is yi, then the delta parameter, Ays, is
wi yt
yt
2.3
i=1
2.2 HMM Acoustic Models(Basic-Single Component) 203
where n is the window width and wi are the regression coefficients. 3
The delta delta parameters, Af, are derived in the same fashion, but
using differences of the delta parameters. When concatenated together
these form the featu
ector 3/
T
T
yt
(24)
The final result is a feature vector whose dimensionality is typically
around 40 and which has been partially but not fully decorrelated
2.2 HMM Acoustic Models(Basic-Single Component
As noted above, each spoken word w is decomposed into a sequence
of Km basic sounds called base phones. This sequence is called its pro-
nunciation i:k=q1, .. 9K. To allow for thc possibility of multiplo
pronunciations, the likelihood p(r w) can be computed over multiple
pronunciations
Y)=∑p(Y)PQl
(25)
where the summation is over all valid pronunciation sequences for a17
Q is a particular sequence of pronunciations
(Q)=IP(q0)
(26)
L=1
and where each q(u) is a valid pronunciation for word wl. In practice
therc will only bc a vcry small numbcr of alternative pronunciations
for each w! making the summation in(2.5) easily tractable
Fach base phone q is represented by a continuous density hmm of
the form illustrated in Figure 2.2 with transition probability param-
ctors laii) and output obscrvation distributions 16i0J. In opcration
an hMm makes a transition from its current state to one of its con
nected states every time step The probability of making a particular
BIn htk to ensure that the same number of frames is maintained after adding delta and
delta-delta parameters, the start and end elements are replicated to fill the regression
A Recognisers often approximate this by a mac operation so that alternative pronunciations
204 Architecture of an HMM-Based Recogniser
a
a
Mode
a12/ a23
a
Vec
Sequence r
vi,b2y2); -y3)4C4) B45)
y
y
Fig. 2. 2 HMM-based phone model
transition from slate s, to state s, is given by the transition probabil-
ity laii. On entering a state, a feature vector is generated using the
distribution associated with the state being entered, bi 0. This form
of process yields the standard conditional independence assumptions
for an hmm:
states are conditionally independent of all other states given
the prcvious state
observations are conditionally independent of all other obser
vations given the state that generated it
For a more detailed discussion of the operation of an HMM see 141
For now. single multivariate gaussians will be assumed for the out
put distribution
b;(y)-N:10),∑0),
(27
where(i)is the mean of state s, and 2() is its covariance. Since the
dimensionality of the acoustic vector y is relatively high, the covariances
are often constrained to be diagonal. Later in HMM Structure Refine
ments, the benefits of using mixtures of gaussians will be discussed
Given the composite HMM Q formed by concatenating all of the
constituent base phones q(ui)., q(uz) then the acoustic likelihood is
YQ)=∑mO,yQ
28
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.