2025-CSEE-301

A Multimodal Transformer Framework to Forecast Engagement in Health Videos

Andrew Dahlstrom

Department of Computer Science

Faculty Supervisor: Anagha Kulkarni

Digital platforms are inundated with video content, and vital health information often struggles to reach its intended audience. In this work, we introduce a multimodal transformer framework designed to forecast engagement in health videos by analyzing both the transcript text and video frame images. Our approach leverages state-of-the-art transformer models for textual data to capture nuanced language patterns and sentiment, while a vision transformer processes key video frames to extract salient visual cues and context. These modality-specific features are then fused and passed through a series of feedforward layers to predict an engagement score. This score represents a continuous metric that reflects how well a video performs in terms of viewer likes. By understanding which elements contribute to higher engagement, our research provides actionable insights for health professionals and researchers aiming to optimize digital health communication. Moreover, the resulting dataset serves as an open resource for further exploration in multimodal learning and digital media analytics, ultimately empowering more effective dissemination of critical health information.