Deep learning framework for speech emotion classification.
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
A robust deep learning-based approach for the recognition and classification of speech emotion is proposed in this research work. Emotion recognition and classification occupy a conspicuous position in human-computer interaction (HCI) and by extension, determine the reasons and justification for human action. Emotion plays a critical role in decision-making as well. Distinguishing among various emotions
(angry, sad, happy, neutral, disgust, fear, and surprise) that exist from speech signals has however been a long-term challenge. There have been some limitations associated with existing deep learning techniques as a result of the complexity of features from human speech (sequential data) which consists of insufficient label datasets, Noise and Environmental Factors, Cross-cultural and Linguistic Differences, Speakers’ Variability and Temporal Dynamics. There is also a heavy reliance on huge parameter tunning, especially for millions of parameters before the model can learn the expected emotional features necessary for classification emotion, which often results in computational complexity, over-fitting, and poor generalization. This thesis presents an innovative deep learning framework-based
approach for the recognition and classification of speech emotions. The deep learning techniques currently in use for speech-emotion classification are exhaustively and analytically reviewed in this thesis.
This research models various approaches and architectures based on deep learning to build a framework that is dependable and efficient for classifying emotions from speech signals. This research proposes a deep transfer learning model that addresses the shortcomings of inadequate training datasets for the classification of speech emotions. The research also models advanced deep transfer learning
in conjunction with a feature selection algorithm to obtain more accurate results regarding the classification of speech emotion. Speech emotion classification is further enhanced by combining the regularized feature selection (RFS) techniques and attention-based networks for the classification of speech emotion with a significant improvement in the emotion recognition results. The problem of misclassification
of emotion is alleviated through the selection of salient features that are relevant to emotion classification from speech signals. By combining regularized feature selection with attention-based mechanisms, the model can better understand emotional complexities and outperform conventional ML model emotion detection algorithms. The proposed approach is very resilient to background noise and cultural differences, which makes it suitable for real-world applications. Having investigated the reasons behind the enormous computing resources required for many deep learning based methods, the research proposed a lightweight deep learning approach that can be deployed on low-memory devices for speech emotion classification. A redesigned VGGNet with an overall model size of 7.94MB is utilized, combined with the best-performing classifier (Random Forest). Extensive experiments and comparisons with other deep learning models (DenseNet, MobileNet, InceptionNet, and ResNet) over three publicly available speech emotion datasets show that the proposed lightweight model improves the performance of emotion classification with minimal parameter size. The research further devises a new method that minimizes computational complexity using a vision transformer (ViT) network for speech emotion classification. The ViT model’s capabilities allow the mel-spectrogram input to be fed into the model, allowing for the capturing of spatial dependencies and high-level features from speech signals that are suitable indicators of emotional states. Finally, the research proposes a novel transformer model that is based on shift-window for efficient classification of speech emotion on bi-lingual datasets. Because this method promotes feature reuse, it needs fewer parameters and works well with smaller datasets. The proposed model was evaluated using over 3000 speech emotion samples from the publicly available TESS, EMODB, EMOVO, and bilingual TESS-EMOVO datasets. The results showed 98.0%, 98.7%, and 97.0% accuracy, F1-Score, and precision, respectively, across the 7 classes of emotion.
Description
Doctoral Degree. University of KwaZulu-Natal, Durban.