VideoNIAH

Introduction

We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples test video content from their query-responses by inserting unrelated image/text 'needles' into original videos. It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses. Additionally, by inserting multiple needles, VideoNIAH rigorously evaluates the temporal understanding capabilities of models. VideoNIAH is a simple yet highly scalable benchmark construction framework, and we believe it will inspire future video benchmark works! We utilize VideoNIAH to compile a video benchmark VNBench, including tasks such as retrieval, ordering, and counting. VNBench contains 1350 samples in total. VNBench can efficiently evaluate the fine-grained understanding ability and spatio-temporal modeling ability of a video model, while also supporting the long-context evaluation.

Leaderboard

VNBench contains 3 task: Retrieval, Ordering and Counting. Each task is divided into 3 sub-tasks according to the needle type and task difficulty.

Video MLLMs	Retrieval				Ordering				Counting				Overall
Video MLLMs	E	I-1	I-2	Avg.	E	I-1	I-2	Avg.	E-1	E-2	I	Avg.	Overall
Gemini 1.5 Pro *	100.0	96.0	76.0	90.7	90.7	95.3	32.7	72.9	60.7	7.3	42.0	36.7	66.7
Aria	100.0	100.0	49.3	83.1	88.7	96.0	58.0	80.9	54.7	11.3	38.7	34.9	66.3
GPT-4o *	100.0	98.0	87.3	95.3	88.4	86.6	45.2	73.4	36.8	0.0	36.1	24.5	64.4
Video-XL-7B	98.0	93.3	48.7	80.0	89.3	77.3	75.3	80.6	38.7	7.3	26.0	24.0	61.6
LongLLaVA-A13	100.0	100.0	73.3	91.1	37.5	35.3	34.8	35.9	36.0	23.7	28.0	29.2	52.1
LLaVA-OneVision-7B	88.7	87.3	55.3	77.1	70.0	50.0	37.3	52.4	41.3	8.7	27.3	25.8	51.8
GPT-4-Turbo *	100.0	99.3	82.0	93.7	42.6	22.8	23.0	29.5	37.6	0.0	32.4	23.3	48.9
Qwen2-VL-7B	98.0	76.0	33.3	69.1	16.0	12.7	8.7	12.4	26.0	9.3	24.7	20.0	33.9
ST-LLM	58.0	64.7	31.3	51.3	0.0	0.0	0.0	0.0	21.3	1.3	27.3	16.7	22.7
LLaVA-NeXT-Video	56.7	56.7	19.3	44.2	0.7	0.0	0.7	0.4	6.7	14.6	25.3	15.5	20.1
VideoChat2	43.4	40.0	14.6	32.7	0.0	0.0	1.3	0.4	3.3	0.7	8.0	4.0	12.4
Video-LLaVA	26.0	28.0	17.3	23.8	0.7	0.7	2.0	1.1	16.7	0.7	20.0	12.4	12.4
LLaMA-VID	28.0	28.0	19.3	25.1	0.7	0.0	0.0	0.2	4.0	2.7	14.7	7.1	10.8
Video-LLaMA2	1.2	26.0	6.0	11.1	0.0	0.0	0.0	0.0	2.0	4.7	0.7	2.4	4.5
VideoChatGPT	4.7	4.7	0.7	3.3	2.7	11.3	0.0	4.7	2.0	4.0	6.7	4.2	4.1

* indicates proprietary models

Data Examples

Different Haystack Length

Task performance on different video durations. We divide all VNBench videos into 3 splits: short(10-30s), medium(30-60s) and long(60-180s).

NIAH Visualization on Different Models

We fix the video haystack and query-response pair in this position test on Retrieval-I-1 task, just modifying the haystack length and needle position.

@article{zhao2024videoniah,
      title={Needle In A Video Haystack: A Scalable  Synthetic Framework for Benchmarking Video MLLMs},
      author={Zhao, Zijia and Lu, Haoyu and Huo, Yuqi and Du, Yifan and Yue, Tongtian and Guo, Longteng and Wang, Bingning and Chen, Weipeng and Liu, Jing},
      journal={arXiv preprint},
      year={2024}
    }

VideoNIAH

Needle In A Video Haystack: A Scalable Synthetic
Framework for Benchmarking Video MLLMs

Introduction

Leaderboard

Benchmark

Data Examples

Analysis Results

Different Haystack Length

NIAH Visualization on Different Models

Citation