AURA: Audio-Visual Emotion Recognition Assistant
Aaryan Singh, Bisum Singh Tiwana, Ratish Sharma
School of Engineering
Faculty Supervisor: Sanchita Ghose
Abstract — This paper introduces AURA (Audio-Visual Emotion Recognition Assistant), a real-time multimodal system designed to analyze human emotional states and attention levels through synchronized facial and audio inputs. The system leverages deep learning models for facial emotion detection using MediaPipe and OpenCV, and audio emotion recognition via spectrogram-based convolutional neural networks trained on datasets like FER-2013 and RAVDESS. AURA also incorporates an attention-tracking module to evaluate gaze direction, engagement, and behavioral patterns—particularly in remote learning environments. Unlike conventional unimodal approaches, AURA’s fusion of visual and auditory cues achieves higher emotion recognition accuracy and responsiveness. Designed with portability in mind, the system can operate both with and without Jetson hardware, ensuring accessibility across platforms. Real-time inference and visualization are supported through a web-based Django-Flask interface with WebSocket communication, delivering low-latency feedback and dynamic emotional analytics. Evaluation results demonstrate significant improvements in recognition performance and engagement assessment, highlighting AURA’s potential applications in education, healthcare, and human-computer interaction.
Index Terms— Emotion Recognition, Multimodal AI, Attention Tracking, Audio-Visual Analysis, Deep Learning, Behavioral Analytics