2026-CSEE-311

Temporal Visual Embeddings for Few-Shot Video Comparison

Jason Avina

Department of Computer Science

Faculty Supervisor: Xuhui Chen

This project addresses a gap in video understanding models. Video understanding models are still inaccurate, have expensive APIs, and if local, are even less accurate and imprecise. This is true when only trying to 'understand' one video. But the problem compounds when this same Visual Language Model tries to compare two videos. My project is about comparing three different approaches to this video comparison problem. A cutting edge video comparison platform that has an API is Gemini Live. This is used as a baseline for comparing two sequences of video. Then a pipeline involving mediapipe and dynamic time warping is used as an alternative. Finally, a pure VLM based approach using VideoMAE - an open source VLM - where dynamic time warping and other techniques are directly applied to the embeddings, along with some customization of the language model. All three approaches are benchmarked, and improvements on the Gemini API are the goal.

Previous abstract

Next abstract

Student Project Showcase 2026

Temporal Visual Embeddings for Few-Shot Video Comparison

Jason Avina

Department of Computer Science

Faculty Supervisor: Xuhui Chen

Contact

Location

Office Hours

Quick Links