Personal tools
Home Events Events Archive 2010 Rapid Speaker Normalization and Adaptation with Applications to Automatic Evaluation of Children's Language Learning Skills

Rapid Speaker Normalization and Adaptation with Applications to Automatic Evaluation of Children's Language Learning Skills

— filed under:

What
  • PhD Defenses
When Mar 04, 2010
from 02:00 PM to 03:00 PM
Where 4549 Boelter Hall
Add event to calendar vCal
iCal

Shizhen Wang
Advisor: Abeer Alwan

Thursday, March 4, 2009 at 2:00pm
Boelter Hall Conference Room 4549

Abstract:
This dissertation investigates speaker variation issues in automatic speech recognition (ASR), with a focus on rapid speaker normalization and adaptation methods using limited enrollment data from the speaker. Investigations are carried out in the direction of reducing spectral variations through frequency warping.

Two methods are developed, one based on the supraglottal (vocal tract) resonances (formants), and the other on resonances from subglottal airways. The first method attempts to reshape (warp) the spectrum by aligning corresponding formant peaks. Since there are various levels of variations in formant structures, regression-tree based phoneme- and state-level spectral peak alignment is studied for rapid speaker adaptation using linearization of the vocal tract length normalization (VTLN) technique. This method is investigated in a maximum likelihood linear regression (MLLR)-like framework, taking advantage of both the efficiency of frequency warping (VTLN) and the reliability of statistical estimations (MLLR). Two different regression classes are investigated: one based on phonetic classes (using combined knowledge and data-driven techniques) and the other based on Gaussian mixture classes.

The second approach utilizes subglottal resonances, which has been shown to affect spectral properties of speech sounds. A reliable algorithm is developed to automatically estimate the second subglottal resonance (Sg2) from speech signals. The algorithm is calibrated on children's speech data with simultaneous accelerometer recordings from which Sg2 frequencies can be directly measured. A cross-language study with bilingual Spanish-English children is performed to investigate whether Sg2 frequencies are independent of speech content and language. The study verifies that Sg2 is approximately constant for a given speaker and thus can be a good candidate for limited data speaker normalization and cross-language adaptation. A speaker normalization method is then presented using Sg2.

As an application, ASR techniques are applied to automatically evaluate children's phonemic awareness through three blending tasks (phoneme blending, onset-rhyme blending and syllable blending). The system incorporates speaker normalization, disfluency detection and Spanish accent detection, together with speech recognition to assess the overall quality of children's speech productions.

Biography:
Shizhen Wang received the B.S. degree from Shandong University, Jinan, China, in 2002, the M.S. degree from Tsinghua University, Beijing, China, in 2005, both in electrical engineering. He is currently working towards the Ph.D. degree in electrical engineering at University of California, Los Angeles (UCLA). His research interests include speech recognition, speaker normalization and adaptation, computer aided language learning, and statistical signal processing.

Document Actions