Voice Source Project


Project Summary

In voiced speech, the vocal folds open and close quasi-periodically and thus convert the glottal air flow (air volume velocity) into a train of flow pulses which is referred to as the voice source excitation signal.

Early models of the source signal used a simple impulse train for modeling voiced excitation. None of these models has been calibrated with direct observations of glottal area changes which are the proximal cause of the air pressure changes that we hear as sound.The effective study of the voice source thus requires both more accurate source models and a comprehensive set of underlying observations on which to base the models. The primary goal of the proposed research is to develop and evaluate a new, more powerful source model based on direct observations of vocal fold vibrations.

Besides the critical need to calibrate source models with underlying physiological data, we also need to better understand the linkage between model parameters and perceived quality. None of the previous source models has been systematically validated perceptually. That is, we cannot presently predict well how a given change(s) in a model parameter will affect what listeners hear.

The voice source contains important lexical and non-lexical information. The non-lexical information can convey, for example, prosodic events, emotional status, as well as cues pertaining to the uniqueness of the speaker’s voice.  In engineering applications, there is a need for a more accurate source model that could model different voice qualities. Such a model could improve the naturalness of TTS systems. In addition, understanding what aspects of the source signal, if any, are speaker-specific, should aid in developing better speaker identification algorithms.

We propose to build on our preliminary work in developing a new source model by recording high-speed images of vocal fold vibrations with simultaneous audio recordings, analyzing the corpus to better parameterize the new voice source model and study speaker variability, performing perception experiments to uncover which aspects of the glottal model are perceptually salient, and using the model in TTS and speaker identification algorithms.

The project fosters interdisciplinary activities at:

This work is supported in part by NSF Grant No. IIS-1018863 and by NIH/NIDCD Grant Nos. DC01797 and DC011300.


Keywords

Voice source, high-speed recording, vocal folds, speech synthesis, speech production model, perceptual validation.


Shareware

Glottaltopograph (GTG) analyze tool: a toolkit to analyze high-speed laryngeal videos.

Glottaltopography is a method to analyze high-speed laryngeal videos. The method is described in this paper: Gang Chen, Jody Kreiman, Abeer Alwan, "The glottaltopogram: a method of analyzing high-speed images of the vocal folds", Computer Speech and Language, 2014, in press. Briefly, the "glottaltopogram" is based on principal component analysis of pixels' light-intensity time sequences from consecutive video images. This method reveals the overall synchronization of the vibrational patterns of the vocal folds over the entire laryngeal area. This method is effective in visualizing pathological and normal vocal fold vibratory patterns. The GTG toolkit is available for download here.

VoiceSauce: A Program for Voice Analysis

VoiceSauce is an application, implemented in Matlab, which provides automated voice measurements over time from audio recordings. Inputs are standard wave (*.wav) files and the measures currently computed are: F0, Formants F1-F4, H1(*), H2(*), H4(*), A1(*), A2(*), A3(*), H1(*)-H2(*), H2(*)-H4(*), H1(*)-A1(*), H1(*)-A2(*), H1(*)-A3(*), Energy, and Cepstral Peak Prominence ... (details)


Project References

Gang Chen, Jody Kreiman, Abeer Alwan, "The glottaltopogram: a method of analyzing high-speed images of the vocal folds", Computer Speech and Language, 2014, in press. [link to the journal article]

G. Chen, M. Garellek, J. Kreiman, B. R. Gerratt, A. Alwan, "A perceptually and physiologically motivated voice source model", Interspeech 2013, pp. 2001-2005. [Best student paper award finalist] [slides and audio samples]

G. Chen, R. A. Samlan, J. Kreiman, A. Alwan, "Investigating the relationship between glottal area waveform shape and harmonic magnitudes through computational modeling and laryngeal high-speed videoendoscopy", Interspeech 2013, pp. 3216-3220. [poster]

M. Garellek, C. M. Esposito, P. Keating, J. Kreiman. Voice quality and tone identification in White Hmong.  J. Acoust. Soc. Am., 133 (2),  1078-1089, 2013. [Link to the journal article]

Marc Garellek. Production and perception of glottal stops. UCLA, Dept. of Linguistics, PhD dissertation, 5/2013

Marc Garellek, Patricia Keating, and Christina M. Esposito. Relative importance of phonation cues in White Hmong tone perception.  Proceedings of BLS 38 [Berkeley Linguistic Society], 2012.

Gang Chen, Jody Kreiman, Bruce Gerratt, Juergen Neubauer, Yen-Liang Shue, and Abeer Alwan, "Development of a glottal area index that integrates glottal gap size and open quotient," Journal of the Acoustical Society of America, Vol. 133, Issue 3, March 2013, pp. 1656–1666. [link to the journal article]

Jody Kreiman, Yen-Liang Shue, Gang Chen, Markus Iseli, Bruce R. Gerratt, Juergen Neubauer, and Abeer Alwan, "Variability in the relationships among voice quality, harmonic amplitudes, open quotient, and glottal area waveform shape in sustained phonation," Journal of the Acoustical Society of America, Volume 132, Issue 4, pp. 2625-2632 (2012). [link to the journal article]


Gang Chen, Yen-Liang Shue, Jody Kreiman, and Abeer Alwan, "Estimating the voice source in noise", Interspeech 2012.

Gang Chen, Jody Kreiman, and Abeer Alwan, "The Glottaltopograph: A Method of Analyzing High-Speed Images of the Vocal Folds", ICASSP 2012, pp.3985-3988.

G. Chen, J. Kreiman, Yen-Liang Shue, and A. Alwan, "Acoustic Correlates of Glottal Gaps," Interspeech 2011, pp 2673-2676

Y.-L. Shue, G. Chen, and A. Alwan, "On the Interdependencies between Voice Quality, Glottal Gaps, and Voice-Source related Acoustic Measures," Interspeech 2010, pp. 34-37.

G. Chen, X. Feng, Y.-L. Shue, and A. Alwan, "On Using Voice Source Measures in Automatic Gender Classification of Children's Speech," Interspeech 2010, pp. 673-676.

Y.-L. Shue and A. Alwan, "A new voice source model based on high-speed imaging and its application to voice source estimation," ICASSP 2010, pp. 5134-5137.

Y.-L. Shue, J. Kreiman, and A. Alwan, "A Novel Codebook Search Technique for Estimating the Open Quotient," Interspeech 2009, pp. 2895-2898.

Y. Shue, S. Shattuck-Hufnagel, M. Iseli, S. Jun, N. Veilleux, and A. Alwan, "On the acoustic correlates of high and low nuclear pitch accents in American English,'' Speech Communication, 2010, Vol 52, No. 2, pp. 106-122.

Y. Shue, S. Shattuck-Hufnagel, M. Iseli, S. Jun, N. Veilleux, and A. Alwan, " Effects of Intonational Phrase Boundaries on Pitch-Accented Syllables in American English ," Interspeech 2008, pp. 873-876. The Best Student Paper Award.

Y. Shue, M. Iseli, N. Veilleux, and A. Alwan "Pitch Accent versus Lexical Stress: Quantifying Acoustic Measures Related to the Voice Source", Proceedings of Interspeech 2007, pp. 2625-2628, Belgium.

M. Iseli, Y.-L. Shue, A. Alwan, "Age, sex, and vowel dependencies of acoustical measures related to the voice source", Journal of the Acoustic Society of America, Vol. 121, Issue 4, pp. 2283-2295, April 2007.

M. Iseli, Y.-L. Shue, M. Epstein, P. Keating, A. Alwan, "Voice Source Correlates of Prosodic Features in American English: a Pilot Study", Proceedings of ICSLP 2006, pp. 2226-2229.

M. Iseli, Y, Shue, and A. Alwan, "Age- and Gender-Dependent Analysis of Voice Source Characteristics", IEEE ICASSP Proceedings, I-389--392, May 2006

Back to SPAPL Home Page.

Abeer Alwan (alwan@ee.ucla.edu)