Speaker Normalization

Speaker Normalization is to reduce the inter-speaker difference in the acoustic features. I formulate the speaker normalization problem as a frequency domain correspondence problem. Bascially, the frequency axes of different speakers need to be aligned before the matching procedure. I called this alignment between frequency axes a frequency domain correspondence. In the literature, Vocal Tract Length Normalization(VTLN) is one well known methods to estimate some specific frequency domain alignments(linear, piece-wise linear, nonlinear). However, the target model of VTLN is a speaker average model which is too smooth to estimate a good alignment to really normalize different speakers. Usually, VTLN achieves relatively 5-10% improvement over the baseline. In my method, I formulate the alignment as a path in a 2D grid network. By assigning the similarity measure between the frequency bins of two frequency axes, I slove the problem via Dynamic Programming(DP). To define the similarity measure between two frequency bins, a vision based feature -- Histogram of Oriented Gradients(HOG) is used to describe the textural structure of local patch in the speech spectrogram. The representor of a frequency bin is the set of accumulating all the HOGs along the time line for this bin. Then a similarity measure is defined between two representor which is just a similarity between two set of vectors. Combing the HOG feature and dynamic programming method, the frequency domain correspondence can be established effecitvely. After alignment, the spectral patterns of different speakers can be matched. The novel points of the frequency domain correspondence method are localized spectral-temporal representation ,--HOG feature, and the unsupervised learning method ,-- dynamic programming, for correspondence establishing. The experiments on TIDIGITS corpus confirm that this method can significantly reduce the speaker difference which results in a huge error rate reduction(more than relatively 90% reduction).

Speaker Modeling

Speaker Modeling is to capture speaker characteristic in acoustic feature space. Universal Background Model(UBM) along with Maximum A Posterior adaptation is a very efficient framework for text-independent speaker recognition. Recently, I am trying to combine generative probablisitc model with discriminative learning methods to achieve better modeling efficience. In the literature, Fisher mapping and augmentive probabislitc modeling are also two methods to combine discriminative learning with generative probablisitc modeling. In my proposal, an Iterative Cohort Model(ICM) was proposed to learn a better metric in the mapped supervector space. It turns out, the sufficient statistics of the conventional EM algorithm are the mapped supervectors. Also, Fisher mapping is can be written as a function of the sufficient statistics which connect the Fisher mapping with our utterance transform.

Face Recognition

The key of any pose-invariant approach for face recognition is how to define a suitable similarity measure that is invariant, or at least robust, to different poses. To this end, one possibility is finding a \emph{pose-invariant representation} for face images. The 3D morphable model is a representative one of such methods, which was shown to be very promising for face recognition across different poses and illuminations. However, fitting a 3D morphable model is rather time consuming and requires a good initialization which usually can only be obtained manually, which considerably limits its application. Another possibility is deriving the \emph{pose-invariant similarity} from the face images directly. Along this line, component-based (sometimes called "feature-based") approaches for face recognition have been proposed. The underlying idea is to calculate face similarity only based on local patches defined at certain facial components, such as eyes, instead of the whole face image. I proposed a 2D warping based approach to tackle the pose invariant face recognition problem. The basic idea is to perform warping at each component pairs before the matching. In stead of apply affine warping or perpsective warping, I adopt dynamic programming to learn a nonparametric warping function. The reason of this choice is due to most of the facial strucutres are not truely plane structures. Therefore, a nonparamatric warping is used to make the warped component best fit with the stored templates.

Audio/Visual Fusion

Fusion of multimodal information is an important area for modern pattern recognition systems. Due to the the increasing availability of multimodal data, more and more pattern recognition systems are fusing different modalities to achieve better performance/robustness, such as audio/visual speech recognition, a/v speaker recognition and a/v person tracking, etc. As a special type of fusion, audio/visual fusion is particular interesting because of these two modalities are the most important for human computer communication. Audio/Visual speech recognition has been shown superior performance over the conventional speech recognizer. Among all the proposed fusion schemes in the literature, there are mainly three types. The first one is early fusion so called feature level fusion, which simply concatenate the different modality feature together. This type of fusion often suffers inferior performance compare to the other fusion methods. The late fusion is also called model-based fusion which combines two modalities by fusion two single-modality statistical models to form a hybrid (multi-modal) statistical model. This multi-modal statistical model has two different types of observations which are fused in the model level. The third fusion strategy is decision level fusion which basically fusion the output of two single-modal statistical models. In these three fusion methods, audio feature and visual feature are treated as a whole vector. However, each dimension of audio and visual feature are not equally correlated to each other. Therefore, I proposed a joint dimension reduction scheme to find the most correlated feature inside the original audio/visual feature space. It is a joint dimension reduction because two set of bases are simultaneously found for audio feature space and visual feature space respectively. The optimization criteria of the joint dimension reduction is to preserve the correlation between audio and visual feature in the original feature space. After dimension reduction, the joint distribution in a more compact audio/visual feature space will be learned to capture the correlation between these two modalities. The other uncorrelated dimension in the original feature space will be modeled by a marginal distribution within each feature space. Using the proposed method, better modeling efficiency can be achieved in the compact feature space which means fewer model parameters compared to the joint a/v distribution in the original feature space. I formulate the joint dimension reduction problem in the framework of canonical correlation analysis(CCA) which has existing efficient and stable solution. The experimental results show that this method successfully capture the correlation between a/v modalities. Also, it achieve the best fusion benefit over than conventional fusion methods.

02/08/2007
© Ming Liu