Voice signals convey hidden, valuable information about speakers, such as age, gender, and emotional state. Extracting this kind of information from human speech is significant in human-computer interaction (HCI). It ...
详细信息
ISBN:
(纸本)9783031780134;9783031780141
Voice signals convey hidden, valuable information about speakers, such as age, gender, and emotional state. Extracting this kind of information from human speech is significant in human-computer interaction (HCI). It enables computers to understand human behaviors and develop interactive systems with customized responses, raising the significance of advancements in speech emotion recognition (SER), especially for languages with large numbers of speakers. Despite over 100 million people speak the Egyptian dialect, SER studies that address the Egyptian dialect are extremely scarce and predominantly rely on traditional machine learning models and convolutional neural networks (CNN) for classification. In this context, we proposed an enhanced compact convolution transformer (CCT) that detects the speaker's age, gender, and emotional state, leveraging the strengths of CNNs for capturing spatial features and transformers for modeling long-range dependencies. The proposed approach combines the best of both architectures, marking a novel architecture for the Egyptian emotion recognition task. To the best of our knowledge, this is the first work to address age detection from Egyptian speech, as well as the first to propose a unified model for the recognition of age, gender, and emotion from Egyptian speech. In the context of HCI improvements, the proposed model was applied in a real-world setting by integrating it into a custom-developed Egyptian chatbot to enhance the chatbot's ability to provide emotionally aware responses based on the user's emotional state.
暂无评论