Modeling Speech Perception in Noise

[ Project Summary | Keywords | Project References]


Project Summary

We almost always listen to speech which is degraded by the addition of competing speech and non-speech signals. Fortunately, we are remarkably adept at isolating a specific speech signal from the background noise and understanding what is said. The purpose of this study is to contribute to a broad research program whose aim is to understand and model human perception of speech in noise.

Developing quantitative models of speech perception in noise is important for providing insights into our cognitive abilities and into the perceptual mechanisms of the hearing impaired. People suffering from hearing loss often have the greatest difficulty understanding speech in noisy environments. Quantitative models describing why healthy hearing manages so well in noisy environments are imperative to design hearing aids that would begin to recover the noise robustness lost with hearing impairment.

The study will also be useful in the development of robust automatic speech recognition algorithms. Currently, the performance of automatic speech recognizers deteriorates significantly at signal-to-noise ratios high enough for humans to hear and understand perfectly.

Work progresses on two fronts: [1] developing fully-parameterized models which predict human perception in noisy environments, and [2] incorporating these models, or aspects of them, into automatic speech recognition systems and speech coders, to improve the systems' performance in noise.

In a recent study, perceptual experiments were conducted to derive a fully-parameterized model of dynamic auditory perception. The dynamic model predicts the saliency of different parts of changing sounds, providing one possible key to understanding the perception of dynamic speech within static noise backgrounds. Initial evaluation of this model incorporated in a simple speech recognition system shows promise improving recognition noise-robustness.

In another set of perceptual experiments, we attempt to quantify the relationship between masked thresholds of signals within noise as a function of signal center frequency, duration, bandwidth, and signal type. These data emphasize that perceptually-based analysis of speech in noise must account for durational, bandwidth, and signal-type effects.

Work supported by NIH-NIDCD 5 R29 DC 02033-02 and NSF.


Keywords

Masking, Speech-in-Noise, Auditory Models.


Project References

H.A. Gupta, A. Raju and A. Alwan, "Non-Linear Dimension Reduction of Gabor Features for Noise-Robust ASR", ICASSP 2014, accepted.

L. N. Tan and A. Alwan, "Feature Enhancement using Sparse Reference and Estimated Soft-Mask Exemplar-Pairs for Noisy Speech Recognition", ICASSP 2014, accepted.

Anirudh Raju and A. Alwan, "The effect of speaking rate, vowel context, and speaker intelligibility on the perception of consonant vowel consonants in noise", J. Acoust. Soc. Am, 134, 4031, 2013. [link to the abstract]

M. Graciarena, A. Alwan, D. Ellis, H.Franco, L. Ferrer, J. Hansen, A. Janin, B.-S. Lee, Y. Lei, V. Mitra, N. Morgan, S. O. Sadjadi, T.J. Tsai, N. Scheffer, L. N. Tan, B. Williams, "All for One: Feature Combination for Highly Channel-Degraded Speech Activity Detection", Interspeech, Lyon, 2013, pp. 709-713.

Kantapon Kaewtip, Lee Ngee Tan, Abeer Alwan, "A Pitch-Based Spectral Enhancement Technique for Robust Speech Processing", Interspeech 2013, pp. 3284-3288.

L. N. Tan, and Abeer Alwan, "Multi-Band Summary Correlogram-based Pitch Detection for Noisy Speech", Speech Communication, Volume 55, Issues 7–8, September 2013, pp. 841-856. [link to the journal article[Matlab code of MBSC pitch detector]

Julien van Hout and Abeer Alwan, "A Novel Approach to Soft-Mask Estimation and Log-Spectral Enhancement For Robust Speech Recognition", ICASSP 2012, pp. 4105-4108.

W. Chu and Abeer Alwan, "SAFE: A Statistical Approach to F0 Estimation under Clean and Noisy Conditions," IEEE Trans. on Audio, Speech, and Language Processing, Volume 20, No. 3, pp. 933 - 944, March 2012.

B. J. Borgstrom and A. Alwan, "A Unified Framework for Designing Optimal STSA Estimators Assuming Additive Superposition of Speech and Noise", IEEE Trans. on Audio, Speech, and Language Processing,  Vol. 19, No. 8, pp. 2579 - 2590 , Nov. 2011.

T. Drugman and A. Alwan, "Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics," Interspeech 2011, pp 1973-1976

Lee Ngee Tan and Abeer Alwan, "Noise-Robust F0 Estimation Using SNR-Weighted Summary Correlograms From Multi-Band Comb Filters," ICASSP 2011, pp. 4464-4467.

Bengt Borgstrom and Abeer Alwan, "Log-Spectral Amplitude Estimation With Generalized Gamma Distributions For Speech Enhancement," ICASSP 2011, pp. 4756-4759.

A. Alwan, J. Jiang and W. Chen, "Perception of place of articulation for plosives and fricatives in noise," Speech Communication, Vol. 53, Issue 2, pp. 195-209, Feb. 2011.

B. J. Borgstrom, P. H. Borgstrom, and A. Alwan, "Efficient HMM-Based Estimation of Missing Features, with Applications to Packet Loss Concealment," Interspeech 2010, pp. 2394-2397.

B. J. Borgstrom, P. H. Borgstrom, and A. Alwan, "Efficient HMM-Based Estimation of Missing Features, with Applications to Packet Loss Concealment," Interspeech 2010, pp. 2394-2397.

W. Chu and A. Alwan, "SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech," Interspeech 2010, pp. 2590-2593. [slides]

B. J. Borgstrom and A. Alwan, "A Statistical Approach to Mel-Domain Mask Estimation for Missing-Feature ASR", IEEE Signal Processing Letters, Vol. 17, No. 11, pp. 941-944, Nov. 2010.

L. N. Tan, B. J. Borgstrom and A. Alwan, "Voice Activity Detection using Harmonic Frequency Components in Likelihood Ratio Test," ICASSP 2010, pp. 4466-4469.

B. J. Borgstrom and A. Alwan, "HMM-Based Reconstruction of Unreliable Spectrographic Data for Noise Robust Speech Recognition", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 5, July 2010.

B. J. Borgstrom and A. Alwan, "Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR," Journal of Selected Topics in Signal Processing, to appear.

H. You and A. Alwan, "Temporal Modulation Processing of Speech Signals for Noise Robust ASR," Interspeech 2009, pp. 36-39.

W. Chu and A. Alwan, "A Correlation-Maximization Denoising Filter Used as an Enhancement Frontend for Noise Robust Bird Call Classification," InterSpeech 2009, pp. 2831-2834. [slides]

V. Mitra, B. Borgstrom, C. Espy-Wilson, and A. Alwan, "A Noise-type and level-dependent MPO-based speech enhancement architecture," InterSpeech 2009, pp. 2751-2754.

B. J. Borgstrom and A. Alwan, "Missing Feature Imputation of Log-Spectral Data For Noise Robust ASR ," to appear, Workshop on DSP in Mobile and Vehicular Systems, 2009.

B. J. Borgstrom and A. Alwan, "Utilizing Compressibility in Reconstructing Spectrographic Data, with Applications to Noise Robust ASR," IEEE Signal Processing Letters, Vol. 16, Issue 5, pp. 398-401, 2009.

W. Chu and A. Alwan, "Reducing F0 Frame Error of F0 Tracking Algorithms Under Noisy Conditions with an Unvoiced/Voiced Classification Frontend," ICASSP 2009, pp.3969-3972. [slides]

A. Alwan, " Dealing with Limited and Noisy Data in ASR: A Hybrid Knowledge-Based and Statistical Approach," Keynote Speech at Interspeech 2008, pp. 11-15.

B. J. Borgstrom and A. Alwan, " HMM-Based Estimation of Unreliable Spectral Components for Noise Robust Speech Recognition ," Interspeech 2008, pp. 1769-1772.

B. J. Borgstrom, A. Bernard, and A. Alwan, " Error Recovery - Channel Coding and Packetization," Chapter 8 in Automatic Speech Recognition on Mobile Devices and over Communication Networks, Springer-Verlag. Editors: Z.-H. Tan and B. Lindberg, pp. 163-185, 2008.

B. J. Borgstrom and A. Alwan, " An Efficient Approximation of the Forward-Backward Algorithm to Deal With Packet Loss, With Applications to Remote Speech Recognition ," ICASSP 2008, pp. 4425-4428.

B. J. Borgstrom and A. Alwan "A Packetization and Variable Bitrate Interframe Compression Scheme For Vector Quantizer-Based Distributed Speech Recognition,"Proceedings of Interspeech 2007, pp. 578-581, Belgium.

Jintao Jiang, Marcia Chen, Abeer Alwan, "On the perception of voicing in syllable-initial plosives in noise", accepted for publication in the Journal of the Acoustical Society of America, November, 2005.

Xiaodong Cui, "Environmental and Speaker Robustness in Automatic Speech Recognition with Limited Learning Data,'' unpublished Ph.D. dissertation, Dept. of Electrical Engineering, UCLA, 8/05.

X. Cui and A. Alwan, "Noise Robust Speech Recognition Using Feature Compensation Based on Polynomial Regression of Utterance SNR," IEEE Transactions on Speech and Audio Processing, Vol. 13, Number 6, pp. 1161-1172, November 2005.

H. You, Q. Zhu, and A. Alwan, "Entropy-base Variable Frame Rate Analysis of Speech Signals and Its Application to ASR," in Proc. ICASSP, Pp.549-552, Montreal, Canada, May. 2004.

X. Cui and A. Alwan, "Combining Feature Compensatoin and Weighted Viterbi Decoding for Noise Robust Speech Recognition With Limited Adaptation Data," in Proc. ICASSP, Pp. 969-972, Montreal, Canada, May. 2004.

Q. Zhu and A. Alwan, "Non-linear feature extraction for robust recognition in stationary and non-stationary noise," Computer, Speech, and Language, 17(4): 381-402, Oct. 2003.

X. Cui, Al. Bernard, and A. Alwan, "A Noise-Robust ASR Back-end Technique Based on Weighted Viterbi Recognition," in Proc. EUROSPEECH, Switzerland, pp. 2169-2172, Sept. 2003.

W. Chen and A. Alwan, "Perpception of the Place of Articulation Feature for Plosives and Fricatives in Noise," in Proc. ICPhS, Barcelona, August, 2003.

James J. Hant and Abeer Alwan, "A Psychoacoustic-Masking Model to Predict the Perception of Speech-Like Stimuli in Noise," Speech Communication, Vol. 40, May 2003, pp. 291-313.

Qifeng Zhu and Abeer Alwan, "The Effect of Additive Noise on Speech Amplitude Spectra: a Quantitative Approach," the IEEE Signal Processing Letters, Vol. 9, Issue 9, Sept. 2002, pp. 275-277

A. Bernard and A. Alwan, "Low-bitrate Distributed Speech Recognition for Packet-based and Wireless Communication", IEEE Transactions on Speech and Audio Processing, Vol. 10, Number 8, pp. 570-580, Nov. 2002.

X. Cui, M. Iseli, Q. Zhu, and A. Alwan, "EVALUATION OF NOISE ROBUST FEATURES ON THE AURORA DATABASES," ICSLP Proceedings, Denver, Colorado, Sep. 2002, Vol.1, pp.481-484.

A. Bernard and A. Alwan, "CHANNEL NOISE ROBUSTNESS FOR LOW-BITRATE REMOTE SPEECH RECOGNITION," ICSLP Proceedings, Denver, Colorado, Sep. 2002, Vol.3, pp.2213-2216.

A. Alwan, Q. Zhu, and J. Lo, "Human and Machine Recognition of Speech Sounds and Noise," Invited paper, Proc. of the World Mulitconference on Systems, Cybernetics,and Information, Vol XIII, pp 218-223, Florida, Aug. 2001.

Brian Strope and Abeer Alwan, "Modeling the Perception of Pitch-Rate Amplitude Modulation in Noise", in "Computational Models of Auditory Function", a book edited by Steve Greenberg and Malcolm Slaney, pp. 315-327, IOS Press, NATO Science Series, Netherlands, 2001.

Q. Zhu, X. Cui, M. Iseli and A. Alwan, "Noise Robust Feature Extraction for ASR using the Aurora 2 Database," Proc. EUROSPEECH 2001, Aalborg, Denmark, Vol. 1, pp. 185-188.

A. Bernard and A. Alwan, "Joint channel decoding - Viterbi recognition for wireless applications," Proc. EUROSPEECH 2001, Aalborg, Denmark, Vol. 4, pp. 2703-2706.

M. Chen and A. Alwan, "On the Perception of Voicing for Plosives in Noise," Proc. EUROSPEECH 2001, Aalborg, Denmark, Vol. 1, pp. 175-178.

Qifeng Zhu, "Noise Robust Front-End Processing for Automatic Speech Recognition," unpublished Ph.D. dissertation, Dept. of Electrical Engineering, UCLA, 12/01.

J. Hant and A. Alwan, ``Predicting the Perceptual Confusion of Synthetic Plosive Consonants in Noise'' to appear in the Proceedings of ICSLP 2000.

Q. Zhu and A. Alwan, "On the use of variable frame rate analysis in speech recognition," Proc. IEEE ICASSP, Istanbul, Turkey, Vol. III, pp. 1783-1786, June 2000.

Q. Zhu and A. Alwan, "Amplitude Demodulation of Speech Spectra and its Application to Noise Robust Speech Recognition," 6th International Conference on Spoken Language Processing, ICSLP 2000. Vol. 1, pp. 341-344

J. Hant and A. Alwan, ``Modeling the masking of formant transitions in noise,'' Proc. of Eurospeech 99, Budapest, Hungary, Vol. 4, p. 1895-1898, September 1999.

A. Bernard and A. Alwan, ``Perceptually-based and embedded wideband CELP coding of speech,'' Proc. of Eurospeech 99, Budapest, Hungary, Vol. 4, p. 1543-1546, September 1999.

A. Alwan, J. Lo, and Q. Zhu, ``Human and machine recognition of nasal consonants in quiet and in noise,'' Proceedings of the 14th International Congress of Phonetic Sciences, Vol. 1 Page 167-170, August, 1999.

J. Hant, B. Strope, and A. Alwan, ``Variable duration notched-noise experiments in a broadband-noise context,'' Journal of the Acoustical Society of America (JASA), Vol. 104, No. 4, p. 2451-2456, October 1998

Brian Strope, ``Modeling auditory perception for robust speech recognition'', unpublished Ph.D. dissertation, Dept. of Electrical Engineering, UCLA, August 1998

B. Strope, and A. Alwan ``Modeling auditory perception to improve robust speech recognition,'' Proc. of the 31st Asilomar Conf. on Signals, Systems, and Computers (Invited), IEEE Comput. Soc., p. 1056-1060 Vol. 2, 1998.

B. Strope and A. Alwan, ``Modeling the perception of pitch-rate amplitude modulation in noise," Proc. of the NATO ASI on Computational Hearing, p. 117-122, Italy, July 1998.

J. Hant, B. Strope, A. Alwan ``Variable-duration notched-noise experiments in a broadband-noise context,'' Proc. of the ICA/ASA, p. 869-870, Seattle, June 1998.

B. Strope and A. Alwan, ``Amplitude modulation cues for perceptual voicing distinction,'' Proc. of the ICA/ASA, p. 209-210, Seattle, June 1998.

Jeff Lo, ``Perception and recognition of nasal consonants in quiet and in noise,'' M.S. thesis, Dept. of Electrical Engineering, UCLA, 1998

B. Strope, and A. Alwan, ``Robust word recognition using threaded spectral peaks,'' Proc. IEEE ICASSP, Vol. 2, p. 625-628, Seattle, May 1998.

B. Strope and A. Alwan, ``A model of dynamic auditory perception and its application to robust word recognition.'' IEEE Transactions on Speech and Audio Processing (SAP) , Vol. 5, No. 2, p. 451-464, September 1997

Vaggelis Petsalis, ``Automatic speech recognition of isolated digits in noise,'' M.S. thesis, Dept. of Electrical Engineering, UCLA, 1997

J. Hant, B. Strope, and A. Alwan, ``A psychoacoustic model for the noise masking of plosive bursts," JASA, Vol. 101, No. 5, p. 2789-2802, May 1997

B. Tang, A. Shen, A. Alwan, and G. Pottie, ``A perceptually-based embedded subband speech coder," IEEE Transactions on SAP, Vol. 5, No. 2, p. 131-140, March 1997

B. Strope and A. Alwan, "Dynamic auditory representations and statistical speech recognition: Threading spectral peaks for robust recognition," Proc. of the Acous. Soc. of Amer. Vol. 100, No. 4, 2788, Dec. 1996.

J. Hant, B. Strope, and A. Alwan, ``A Psychoacoustic Model for the Noise Masking of Voiceless Plosive Bursts,'' Proc. of Int. Conf. of Spoken Language Processing (ICSLP), Philadelphia, October 1996, p. 570-573.

Jim Hant, `` A psychoacoustic model to predict the noise masking of plosive bursts,'' M.S. thesis, Department of Electrical Engineering, UCLA, June 1996.

J. Hant, B. Strope, and A. Alwan ``Predicting noise-masked thresholds of plosive bursts,'' 4th Lake Arrowhead Conference on Issues in Advanced Hearing Aid Research, May 1996

B. Strope and A. Alwan, `` A Model of Dynamic Auditory Perception and its Application to Robust Speech Recognition ,'' Proc. of the IEEE Int. Conf. Acous. Speech Sig. Proc. (ICASSP), Vol. I, 37-40, Atlanta, May 1996.

J. Hant, B. Strope , and A. Alwan,`` Durational Effects on Masked Thresholds in Noise as a Function of Signal Frequency, Bandwidth, and Type,'' Proc. of the Acous. Soc. Amer. (ASA), Nov. 1995.

A. Alwan, S. Narayanan, B. Strope, and A. Shen, ``Speech Production and Perception Models and their Applications to Synthesis, Recognition, and Coding,'' Proc. of the Int. Symp. Sig. Sys. and Elec. (ISSSE), Oct. 1995.

B. Strope and A. Alwan, ``A First-Order Model of Dynamic Auditory Perception,'' Proc. NIH Hearing Aid Research and Development Workshop, Sep. 1995.

Brian Strope,``A Model of Dynamic Auditory Perception and its Application to Robust Speech Recognition,'' M.S. thesis, Department of Electrical Engineering, UCLA, June 1995.

B. Strope and A. Alwan, `` A Novel Structure to Compensate for Frequency-Dependent Loudness Recruitment of Sensorineural Hearing Loss,'' Proc. Int. Con. Acous. Speech Sig. (ICASSP), May 1995, Vol. V, 3539-3542.

A. Shen, B. Tang, A. Alwan, and G. Pottie, ``A Robust and Variable-Rate Speech Coder,'' Proc. ICASSP, May 1995, Vol. I, 249-252.

B. Strope and A. Alwan, ``Mapping of Constant Loudness Contours with Filter Mixtures in Digital Hearing Aids,'' 3rd Lake Arrowhead Conference on Hearing Aid Research, June 1994.

A. Shen,``Perceptually-Based Subband Coding,'' M.S. thesis, Department of Electrical Engineering, UCLA, June 1994.

A. Alwan, ``A Perceptual Metric for Masking,'' Proc. IEEE ICASSP, Vol. 2, 712-715, April 1993.

A. Alwan, ``Modeling speech perception in noise: a case study of the place of articulation feature,'' the XII Int. Con. of Phon. Sci., Vol. 2, 78-81, August 1991, France


Back to SPAPL Home Page.

Abeer Alwan (alwan@icsl.ucla.edu)