[Colloq] PhD Thesis Defense, Bing Zhang

Mon Apr 9 16:39:26 EDT 2007

College of Computer and Information Science
PhD Thesis Defense:
Bing Zhang

Thesis Title:
Discriminative Feature Optimization for Speech Recognition

Thursday, April 12, 2007
10:00am
164 West Village H

Abstract
Feature extraction, whose goal is to obtain a compact and discriminative 
representation of speech data, is an important step in acoustic modeling 
of speech recognition systems. The extraction usually occurs at two 
stages. In the first stage, signal processing methods are used to 
transform raw speech signals to cepstral coefficients. Then in the 
second stage, various feature transforms can be employed to select 
features that better fit the particular acoustic model.

In traditional feature transform techniques, the optimization criteria 
are usually not closely related to recognition errors, hence the derived 
feature transforms are suboptimal in terms of improving the accuracy of 
the whole system. To solve this problem, a discriminative feature 
optimization method is developed in this thesis, based on the Minimum 
Phoneme Error (MPE) criterion, which has been shown to be well 
correlated with the word error rate (WER).

In addition to the discriminative criterion, we also want to use 
nonlinear feature transforms that are more powerful than traditionally 
used linear transforms.  However, the problem is that the computational 
cost can be very high when a discriminative criterion is used to train a 
general nonlinear transform (e.g., a neural network). For this reason, 
the concept of region-dependent transform (RDT) is developed in this 
thesis. The central idea behind it is to divide the acoustic space into 
multiple regions, and to use different transform functions for different 
regions.  This effectively produces a powerful piece-wise transform that 
can be estimated more efficiently than general non-linear transforms.

At the software infrastructure level, the method is implemented in terms 
of a generic feature transform framework. Under this framework, various 
feature transforms can be trained uniformly through a generalized 
back-propagation algorithm.

The method has been developed under the context of a state-of-the-art 
speech recognition system, which brings various questions about how the 
method interacts with the rest of the system. These issues include, for 
instance, the generalization problem of the feature transform in 
different acoustic models, and the problem of integrating
discriminatively trained feature transforms with maximum likelihood 
based speaker adaptation. Experimental approaches are developed in this 
thesis in order to address these issues.

Finally, the thesis shows that using the discriminatively RDT, we are 
able to obtain up to 7% relative WER reduction to the state-of-the-art 
systems.

Co-advisors: Dr. John Makhoul, BBN Technologies
              Dr. Harriet Fell, Northeastern University