Temporal Visual Embeddings for Few-Shot Video Comparison
Jason Avina
Department of Computer Science
Faculty Supervisor: Xuhui Chen
This project addresses a gap in video understanding models. Video understanding models are still inaccurate, have expensive APIs, and if local, are even less accurate and imprecise. This is true when only trying to 'understand' one video. But the problem compounds when this same Visual Language Model tries to compare two videos. My project is about comparing three different approaches to this video comparison problem. A cutting edge video comparison platform that has an API is Gemini Live. This is used as a baseline for comparing two sequences of video. Then a pipeline involving mediapipe and dynamic time warping is used as an alternative. Finally, a pure VLM based approach using VideoMAE - an open source VLM - where dynamic time warping and other techniques are directly applied to the embeddings, along with some customization of the language model. All three approaches are benchmarked, and improvements on the Gemini API are the goal.