Abstract
Human interactions are profoundly influenced by the ability to perceive and interpret emotions, a capability that shapes our relationships, decisions, and communication. Emotions are inherently complex, involving subtle expressions through multiple modalities—text, speech, facial expressions, and motion. Despite advancements, current systems face significant challenges in accurately capturing emotions, particularly in real-world scenarios with varying conditions and noisy inputs. This thesis addresses these limitations by proposing the MIST (Motion- Image-Speech-Text) framework, a multimodal approach to improve the robustness and accuracy of emotion recognition systems.To enhance the Text Emotion Recognition (TER) component of the frame- work, two advanced deep learning models, GPT-3.5 and BERT, are evaluated. A comparative analysis highlights the strengths and limitations of these models, with BERT emerging as the more precise and reliable option for most textual emotion recognition tasks. Its strong performance on benchmark datasets justifies its integration into the MIST framework, ensuring robust handling of textual inputs.
The MIST framework combines emotion recognition from multiple modalities, including text, speech, facial expressions, and motion. Each modality is pro- cessed through tailored architectures, leveraging the complementary strengths of unimodal approaches. The fusion of modalities is achieved using a novel adaptive strategy, enabling effective integration and improved recognition accuracy.
The robustness of MIST is further tested under challenging conditions, particularly for Face Emotion Recognition (FER). Scenarios such as occlusion, low lighting, and degraded image quality are examined, and preprocessing techniques are proposed to mitigate these challenges.
To demonstrate the practical applicability of the framework, a real-time emotion recognition software is developed based on MIST. This software enables dynamic and efficient emotion detection, suitable for applications in interactive systems, gaming, and mental health monitoring. It is designed to provide low-latency responses while maintaining high accuracy, even under unpredictable environ- mental conditions.
Experimental results show that the MIST framework outperforms unimodal and baseline multimodal approaches, achieving higher accuracy and robustness across multiple datasets and testing scenarios. The key contributions of this thesis include the development of the MIST framework, the evaluation and integration of BERT and GPT for TER, and the creation of a real-time software application for emotion recognition based on multiple modalities. This work not only advances the field of multimodal emotion recognition but also bridges the gap between theoretical innovations and real-world applications.
Date of Award | 17 Jan 2025 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Alaa Mohasseb (Supervisor) & Ella Haig (Supervisor) |