Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan
We introduce Vote-in-Context (ViC), a training-free framework that reframes list-wise reranking and multi-modal rank fusion as a zero-shot reasoning task for Vision-Language Models. Paired with our S-Grid serialization for list-wise video reasoning, ViC achieves state-of-the-art zero-shot retrieval performance on MSR-VTT and VATEX.