1
Speech Recognition
Victor Zue, Ron Cole, & Wayne Ward
MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA
Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
1
Defining the Problem
Speech
recognition
is
the
process
of
converting
an
acoustic
signal,
captured
by
a
microphone or a telephone, to a set of words. The recognized words can be the final results, as
for applications such as commands & control, data entry, and document preparation. They can
also serve as the input to further linguistic processing in order to achieve speech understanding, a
subject covered in section.
Speech
recognition
systems
can
be
characterized
by
many
parameters,
some
of
the
more
important of which are shown in Figure. An isolated-word speech recognition system requires
that the speaker pause briefly between words, whereas a continuous speech recognition system
does
not.
Spontaneous,
or
extemporaneously
generated,
speech
contains
disfluencies,
and
is
much
more
difficult
to
recognize
than
speech
read
from
script.
Some
systems
require
speaker
enrollment---a user must provide samples of his or her speech before using them, whereas other
systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other
parameters
depend
on
the
specific
task.
Recognition
is
generally
more
difficult
when
vocabularies
are
large
or
have
many
similar-sounding
words.
When
speech
is
produced
in
a
sequence of words, language models or artificial grammars are used to restrict the combination
of words.
The
simplest
language
model
can
be
specified
as
a
finite-state
network,
where
the
permissible
words
following
each
word
are
given
explicitly.
More
general
language
models
approximating natural language are specified in terms of a context-sensitive grammar.