[Colloq] PhD Thesis Defense, Bing Zhang
Rachel Kalweit
rachelb at ccs.neu.edu
Mon Apr 9 16:39:26 EDT 2007
College of Computer and Information Science
PhD Thesis Defense:
Bing Zhang
Thesis Title:
Discriminative Feature Optimization for Speech Recognition
Thursday, April 12, 2007
10:00am
164 West Village H
Abstract
Feature extraction, whose goal is to obtain a compact and discriminative
representation of speech data, is an important step in acoustic modeling
of speech recognition systems. The extraction usually occurs at two
stages. In the first stage, signal processing methods are used to
transform raw speech signals to cepstral coefficients. Then in the
second stage, various feature transforms can be employed to select
features that better fit the particular acoustic model.
In traditional feature transform techniques, the optimization criteria
are usually not closely related to recognition errors, hence the derived
feature transforms are suboptimal in terms of improving the accuracy of
the whole system. To solve this problem, a discriminative feature
optimization method is developed in this thesis, based on the Minimum
Phoneme Error (MPE) criterion, which has been shown to be well
correlated with the word error rate (WER).
In addition to the discriminative criterion, we also want to use
nonlinear feature transforms that are more powerful than traditionally
used linear transforms. However, the problem is that the computational
cost can be very high when a discriminative criterion is used to train a
general nonlinear transform (e.g., a neural network). For this reason,
the concept of region-dependent transform (RDT) is developed in this
thesis. The central idea behind it is to divide the acoustic space into
multiple regions, and to use different transform functions for different
regions. This effectively produces a powerful piece-wise transform that
can be estimated more efficiently than general non-linear transforms.
At the software infrastructure level, the method is implemented in terms
of a generic feature transform framework. Under this framework, various
feature transforms can be trained uniformly through a generalized
back-propagation algorithm.
The method has been developed under the context of a state-of-the-art
speech recognition system, which brings various questions about how the
method interacts with the rest of the system. These issues include, for
instance, the generalization problem of the feature transform in
different acoustic models, and the problem of integrating
discriminatively trained feature transforms with maximum likelihood
based speaker adaptation. Experimental approaches are developed in this
thesis in order to address these issues.
Finally, the thesis shows that using the discriminatively RDT, we are
able to obtain up to 7% relative WER reduction to the state-of-the-art
systems.
Co-advisors: Dr. John Makhoul, BBN Technologies
Dr. Harriet Fell, Northeastern University
More information about the Colloq
mailing list