Logo VideoNIAH

Needle In A Video Haystack: A Scalable Synthetic
Framework for Benchmarking Video MLLMs

Zijia Zhao1, Haoyu Lu2, Yuqi Huo3, Yifan Du2, Tongtian Yue1, Longteng Guo1, Bingning Wang3, Weipeng Chen3, Jing Liu1,
1CASIA, 2RUC, 3Baichuan

Introduction

We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples test video content from their query-responses by inserting unrelated image/text 'needles' into original videos. It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses. Additionally, by inserting multiple needles, VideoNIAH rigorously evaluates the temporal understanding capabilities of models. VideoNIAH is a simple yet highly scalable benchmark construction framework, and we believe it will inspire future video benchmark works! We utilize VideoNIAH to compile a video benchmark VNBench, including tasks such as retrieval, ordering, and counting. VNBench contains 1350 samples in total. VNBench can efficiently evaluate the fine-grained understanding ability and spatio-temporal modeling ability of a video model, while also supporting the long-context evaluation.

Leaderboard

VNBench contains 3 task: Retrieval, Ordering and Counting. Each task is divided into 3 sub-tasks according to the needle type and task difficulty.

Video MLLMs Retrieval Ordering Counting Overall
E I-1 I-2 Avg. E I-1 I-2 Avg. E-1 E-2 I Avg.
Gemini 1.5 Pro * 100.0 96.0 76.0 90.7 90.7 95.3 32.7 72.9 60.7 7.3 42.0 36.7 66.7
Aria 100.0 100.0 49.3 83.1 88.7 96.0 58.0 80.9 54.7 11.3 38.7 34.9 66.3
GPT-4o * 100.0 98.0 87.3 95.3 88.4 86.6 45.2 73.4 36.8 0.0 36.1 24.5 64.4
Video-XL-7B 98.0 93.3 48.7 80.0 89.3 77.3 75.3 80.6 38.7 7.3 26.0 24.0 61.6
LongLLaVA-A13 100.0 100.0 73.3 91.1 37.5 35.3 34.8 35.9 36.0 23.7 28.0 29.2 52.1
LLaVA-OneVision-7B 88.7 87.3 55.3 77.1 70.0 50.0 37.3 52.4 41.3 8.7 27.3 25.8 51.8
GPT-4-Turbo * 100.0 99.3 82.0 93.7 42.6 22.8 23.0 29.5 37.6 0.0 32.4 23.3 48.9
Qwen2-VL-7B 98.0 76.0 33.3 69.1 16.0 12.7 8.7 12.4 26.0 9.3 24.7 20.0 33.9
ST-LLM 58.0 64.7 31.3 51.3 0.0 0.0 0.0 0.0 21.3 1.3 27.3 16.7 22.7
LLaVA-NeXT-Video 56.7 56.7 19.3 44.2 0.7 0.0 0.7 0.4 6.7 14.6 25.3 15.5 20.1
VideoChat2 43.4 40.0 14.6 32.7 0.0 0.0 1.3 0.4 3.3 0.7 8.0 4.0 12.4
Video-LLaVA 26.0 28.0 17.3 23.8 0.7 0.7 2.0 1.1 16.7 0.7 20.0 12.4 12.4
LLaMA-VID 28.0 28.0 19.3 25.1 0.7 0.0 0.0 0.2 4.0 2.7 14.7 7.1 10.8
Video-LLaMA2 1.2 26.0 6.0 11.1 0.0 0.0 0.0 0.0 2.0 4.7 0.7 2.4 4.5
VideoChatGPT 4.7 4.7 0.7 3.3 2.7 11.3 0.0 4.7 2.0 4.0 6.7 4.2 4.1

* indicates proprietary models

Benchmark

Data Examples

Analysis Results

Different Haystack Length

grade-lv

Task performance on different video durations. We divide all VNBench videos into 3 splits: short(10-30s), medium(30-60s) and long(60-180s).

NIAH Visualization on Different Models

We fix the video haystack and query-response pair in this position test on Retrieval-I-1 task, just modifying the haystack length and needle position.

Citation

@article{zhao2024videoniah,
      title={Needle In A Video Haystack: A Scalable  Synthetic Framework for Benchmarking Video MLLMs},
      author={Zhao, Zijia and Lu, Haoyu and Huo, Yuqi and Du, Yifan and Yue, Tongtian and Guo, Longteng and Wang, Bingning and Chen, Weipeng and Liu, Jing},
      journal={arXiv preprint},
      year={2024}
    }