AURORA: Audio-Visual Emotion Recognition Assistant
Aaryan Singh, Bisum Singh Tiwana, Ratish Sharma
School of Engineering
Faculty Supervisor: Sanchita Ghose
This paper introduces AURORA (Audio-Visual Emotion Recognition Assistant for Real-Time Attention Tracking and Behavioral Insight), a multimodal AI system designed to analyze human emotional states and attentional engagement through synchronized facial and vocal data. Unlike systems that rely solely on prebuilt libraries, AURORA integrates custom-designed convolutional neural network (CNN) architectures for both facial expression analysis and audio emotion classification, trained on benchmark datasets such as FER-2013, RAVDESS, and TESS. While OpenCV and MediaPipe are used for initial facial detection and preprocessing, the core emotion recognition engine is developed in-house to ensure higher accuracy, adaptability, and scalability. AURORA also features an attention-tracking module capable of estimating gaze direction and engagement patterns, enabling deeper behavioral insights in real-time settings such as online education, therapy sessions, and customer service environments. The system is optimized for cross-platform deployment, supporting both Jetson-enabled embedded systems and standard computing hardware. A fully interactive Streamlit-based web interface facilitates real-time feedback via webcam and microphone input, as well as offline media uploads. Evaluation results demonstrate strong accuracy, fast inference, and actionable emotion insights, making AURORA⁺ a promising tool for emotion-aware AI in education, healthcare, and human-computer interaction.
Index Terms— Emotion Recognition, Multimodal AI, Attention Tracking, Audio-Visual Analysis, Deep Learning, Behavioral Analytics