Towards Understanding Voice Discrimination Abilities of Humans and Machines

Speaker: Soo Jin Park
Affiliation: UCLA Ph.D. Candidate

Abstract: An individual’s voice can vary dramatically depending on word choice, affect, and other factors. Such intrinsic within-talker variability causes considerable difficulties when distinguishing talkers by their voices, both for humans and machines. For machines, phonetic content variability substantially degrades performance when utterances are short. Humans, on the contrary, are less influenced by content variability, and they perform better than machines in such conditions. Hence, understanding which and how acoustic features are related to human responses might provide insights to improve machine performance. Yet, little is known about human and machine voice discrimination ability under various kinds of intrinsic within-talker variabilities.

This dissertation presents studies of voice discrimination abilities of humans and machines under text, affect, and speaking-style variabilities. The main focus is in developing a feature set, based on a psychoacoustic model of voice quality, that can be used to improve machine performance and to find acoustic correlates with human responses. Preliminary experiments indicated that the voice quality feature set (VQual1) was promising for predicting human responses, and for improving automatic speaker verification (ASV) performance which degraded significantly under text, affect and/or speaking-style variabilities. VQual1 was modified to another set (VQual2) to better differentiate talkers, leading to further improvements in short-utterance text-independent ASV tasks.

Voice discrimination abilities of humans and machines for very short utterances (2 sec) under high text and style variability were analyzed using read sentences and pet-directed speech. Humans were more accurate than machines for read sentence pairs, but the performance difference became small for style-mismatched pairs and for perceptually marked talkers. Humans’ and machines’ decision spaces were weakly correlated, indicating a weak or non-linear relationship between talker representations by humans and machines. However, for different-talker pairs, the VQual2-based system responses were highly correlated with human responses. Results also suggested that machines could supplement human decisions for perceptually marked talkers. Additionally, VQual2 was effective in perceived affect recognition, suggesting another application where voice quality features can contribute to predict human decisions.

Biography: Soo Jin Park received her B.S. and M.S. degrees in Electrical and Electronic Engineering, Yonsei University, Seoul, Korea in 2011 and 2013, respectively. She is currently a Ph.D. candidate in the Electrical and Computer Engineering Department at UCLA under the supervision of Prof. Abeer Alwan.  Soo Jin’s doctoral research focused on applying knowledge from perception studies to automatically retrieving information from a speech signal, such as the identity and emotional status of the speaker.  Much of her research has been conducted in collaborations with the UCLA medical school and linguistics department.

For more information, contact Prof. Abeer Alwan ()

Date(s) - Mar 06, 2019
2:00 pm - 4:00 pm

E-IV Tesla Room #53-125
420 Westwood Plaza - 5th Flr., Los Angeles CA 90095