Abstract
Human emotion recognition is a rapidly evolving field in artificial intelligence, crucial for improving human-computer interaction. This paper introduces the MIST (Motion, Image, Speech, and Text) framework, a novel multimodal approach to emotion recognition that integrates diverse data modalities. Unlike existing models focusing on unimodal analysis, MIST leverages the complementary strengths of text (using DeBERTa), speech (using Semi-CNN), facial (using ResNet-50), and motion (using 3D-CNN) data to enhance accuracy and reliability. Our evaluation, conducted on the BAUM-1 and SAVEE datasets, demonstrates that MIST significantly outperforms traditional unimodal and some multimodal approaches in emotion recognition tasks. This research advances the field by providing a better understanding of emotional states, with potential applications in social robots, personal assistants, and educational technologies.
Original language | English |
---|---|
Article number | 126236 |
Number of pages | 12 |
Journal | Expert Systems with Applications |
Volume | 270 |
Early online date | 14 Jan 2025 |
DOIs | |
Publication status | Published - 22 Jan 2025 |
Keywords
- Multimodal Emotion Recognition
- MIST Framework
- Deep Learning Models
- BAUM-1 Dataset
- SAVEE Dataset
- Text Emotion Recognition (TER)
- Speech Emotion Recognition (SER)
- Face Emotion Recognition (FER)
- Motion Emotion Recognition (MER)
- Data Stream Integration