MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis

Enguerrand Jean-Claude Patrick Boitel*, Alaa Mohasseb, Ella Haig

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

10 Downloads (Pure)

Abstract

Human emotion recognition is a rapidly evolving field in artificial intelligence, crucial for improving human-computer interaction. This paper introduces the MIST (Motion, Image, Speech, and Text) framework, a novel multimodal approach to emotion recognition that integrates diverse data modalities. Unlike existing models focusing on unimodal analysis, MIST leverages the complementary strengths of text (using DeBERTa), speech (using Semi-CNN), facial (using ResNet-50), and motion (using 3D-CNN) data to enhance accuracy and reliability. Our evaluation, conducted on the BAUM-1 and SAVEE datasets, demonstrates that MIST significantly outperforms traditional unimodal and some multimodal approaches in emotion recognition tasks. This research advances the field by providing a better understanding of emotional states, with potential applications in social robots, personal assistants, and educational technologies.
Original languageEnglish
Article number126236
Number of pages12
JournalExpert Systems with Applications
Volume270
Early online date14 Jan 2025
DOIs
Publication statusPublished - 22 Jan 2025

Keywords

  • Multimodal Emotion Recognition
  • MIST Framework
  • Deep Learning Models
  • BAUM-1 Dataset
  • SAVEE Dataset
  • Text Emotion Recognition (TER)
  • Speech Emotion Recognition (SER)
  • Face Emotion Recognition (FER)
  • Motion Emotion Recognition (MER)
  • Data Stream Integration

Fingerprint

Dive into the research topics of 'MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis'. Together they form a unique fingerprint.

Cite this