Abstract
Human emotion recognition is a rapidly evolving field in artificial intelligence, crucial for improving human-computer interaction. This paper introduces the MIST (Motion, Image, Speech, and Text) framework, a novel multimodal approach to emotion recognition that integrates diverse data modalities. Unlike existing models focusing on unimodal analysis, MIST leverages the complementary strengths of text (using DeBERTa), speech (using Semi-CNN), facial (using ResNet-50), and motion (using 3D-CNN) data to enhance accuracy and reliability. Our evaluation, conducted on the BAUM-1 and SAVEE datasets, demonstrates that MIST significantly outperforms traditional unimodal and some multimodal approaches in emotion recognition tasks. This research advances the field by providing a better understanding of emotional states, with potential applications in social robots, personal assistants, and educational technologies.
| Original language | English |
|---|---|
| Article number | 126236 |
| Number of pages | 12 |
| Journal | Expert Systems with Applications |
| Volume | 270 |
| Early online date | 14 Jan 2025 |
| DOIs | |
| Publication status | Published - 22 Jan 2025 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 4 Quality Education
Keywords
- Multimodal Emotion Recognition
- MIST Framework
- Deep Learning Models
- BAUM-1 Dataset
- SAVEE Dataset
- Text Emotion Recognition (TER)
- Speech Emotion Recognition (SER)
- Face Emotion Recognition (FER)
- Motion Emotion Recognition (MER)
- Data Stream Integration
Fingerprint
Dive into the research topics of 'MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver