Security

Vulnerabilities in Speech Emotion Recognition Models Exposed by Researchers Description:

10 August 2024

|

Zaker Adham

Summary

Recent advancements in speech emotion recognition have showcased the potential of technologies across various applications. However, these models are vulnerable to adversarial attacks. Researchers at the University of Milan have systematically evaluated the impact of white-box and black-box attacks on different languages and genders within speech emotion recognition. Their findings were published on May 27 in Intelligent Computing.

 

The study highlights the significant vulnerability of convolutional neural network long short-term memory (CNN-LSTM) models to adversarial examples, which are carefully crafted inputs designed to mislead the models into making incorrect predictions. The research indicates that all tested adversarial attacks can substantially degrade the performance of speech emotion recognition models. According to the authors, the susceptibility of these models to adversarial attacks "could trigger serious consequences." deep learning

 

The researchers proposed a methodology for audio data processing and feature extraction tailored to the CNN-LSTM architecture. They examined three datasets: EmoDB for German, EMOVO for Italian, and RAVDESS for English. They utilized various attack methods, including the Fast Gradient Sign Method, Basic Iterative Method, DeepFool, Jacobian-based Saliency Map Attack, and Carlini and Wagner for white-box attacks, as well as the One-Pixel Attack and Boundary Attack for black-box scenarios.

 

The black-box attacks, particularly the Boundary Attack, achieved impressive results despite limited access to the models' internal workings. In some cases, black-box attacks outperformed white-box attacks by generating adversarial examples with better performance and less disruption. The authors noted, "These observations are alarming as they imply that attackers can potentially achieve remarkable results without any understanding of the model's internal operation, simply by scrutinizing its output."

 

The research also incorporated a gender-based perspective to investigate the differential impacts of adversarial attacks on male and female speech, as well as on speech in different languages. The study found only minor performance differences across the three languages, with English being the most susceptible and Italian showing the highest resistance. Male samples exhibited slightly less accuracy and perturbation, particularly in white-box attack scenarios, but the variations between male and female samples were negligible.

 

"We devised a pipeline to standardize samples across the three languages and extract log-Mel spectrograms. Our methodology involved augmenting datasets using pitch shifting and time stretching techniques while maintaining a maximum sample duration of three seconds," the authors explained. Additionally, the team used the same CNN-LSTM architecture for all experiments to ensure methodological consistency.

 

While publishing research on vulnerabilities in speech emotion recognition models might seem risky, withholding these findings could be more detrimental. Transparency in research allows both attackers and defenders to understand the weaknesses in these systems. By making these vulnerabilities known, researchers and practitioners can better prepare and fortify their systems against potential threats, ultimately contributing to a more secure technological landscape.