M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Jie Huang1,* Ruixun Liu1,* Sirui Sun1 Xinyi Yang1 Yin Li2 Yixin Zhu1 Yiwu Zhong1,†
1Peking University 2University of Wisconsin-Madison
* Equal contribution. † Corresponding author.
M3Eval benchmark overview figure
M3Eval evaluates multi-modal memory through video tasks grounded in cognitive psychology.

Abstract

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial effort in developing video datasets and benchmarks, existing work primarily focuses on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference.

To address this gap, we introduce M3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models.

Grounded in cognitive psychology, our design features carefully constructed tasks isolating key aspects of memory. Leveraging M3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors.

We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory.

Collectively, our benchmark provides a valuable resource for future research, whereas our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models.

Video Memory Tasks

Task 1

Divided Attention encoding concurrent information

Divided Attention task figure

Simultaneous memory for two side-by-side videos.

Task 2

Memory Interference robustness to distraction

Memory Interference task figure

Interference between sequentially presented videos.

Task 3

Interleaved Events temporal organization

Interleaved Events task figure

Memory reconstruction from temporally interleaved clips.

Task 4

N-Back symbol grounding and memory capacity

N-Back task figure

Judge whether the final clip matches the clip N positions earlier.

Results and Findings

Divided Attention

Main Result

Accuracy (%) on three divided attention metrics under the split-screen setting without swaps and with frequent left/right swaps.

Acc. (%) No swapping Swapping
Source
Identification
Order
Understanding
Content
Retention
Source
Identification
Order
Understanding
Content
Retention
Human89.5890.0092.1681.25 (-8.33)85.00 (-5.00)86.27 (-5.89)
Random25.0025.0025.0025.00 (0.00)25.00 (0.00)25.00 (0.00)
Closed-Source Models
Gemini-3.1-Pro-Preview62.5052.5049.0237.50 (-25.00)52.50 (0.00)56.86 (+7.84)
GPT-5.427.0835.0047.0635.42 (+8.34)30.00 (-5.00)49.02 (+1.96)
Open-Source Agents
VideoLucy16.6742.5037.2514.58 (-2.09)25.00 (-17.50)39.22 (+1.97)
M3-Agent27.0830.0023.5331.25 (+4.17)35.00 (+5.00)23.53 (0.00)
Open-Source Models
Qwen3.5-4B18.7525.0031.3714.58 (-4.17)22.50 (-2.50)33.33 (+1.96)
Qwen3-VL-8B-Instruct16.6725.0037.2512.50 (-4.17)30.00 (+5.00)35.29 (-1.96)
InternVL3.5-8B29.1737.5033.3325.00 (-4.17)40.00 (+2.50)27.45 (-5.88)
Qwen3.5-9B35.4225.0025.4918.75 (-16.67)30.00 (+5.00)13.73 (-11.76)
Qwen3.5-27B41.6725.0035.2927.08 (-14.59)32.50 (+7.50)35.29 (0.00)

Further Experiment

Divided Attention further experiment
Attention shifts induced by split-screen interference. For each case, the left panel shows the single-video condition, whereas the right panel shows the split-screen condition. Under the split-screen setting, the model's attention is disrupted, resulting in erroneous responses.
Finding 1: Existing multi-modal models lack robust memory for parallel tasks, probably due to attention confusion across concurrent visual streams.

Memory Interference

Main Result

Proactive: the first video (V1) interferes with recall of the second video (V2); retroactive: the second video (V2) interferes with recall of the first video (V1). Delta denotes proactive minus retroactive.

Model Accuracy (%) Intrusion Rate (%)
Proactive Retroactive Delta Proactive Retroactive Delta
Human94.5574.5520.003.6420.00-16.36
Random25.0025.000.0050.0050.000.00
Closed-Source Models
Gemini-3.1-Pro-Preview63.6454.559.0923.6430.91-7.27
GPT-5.443.6440.003.6443.6434.559.09
Open-Source Agents
VideoLucy29.0943.64-14.5543.6434.559.09
M3-Agent43.6436.367.2840.0034.555.45
Open-Source Models
Qwen3.5-4B29.0938.18-9.0945.4538.187.27
Qwen3-VL-8B-Instruct25.4529.09-3.6454.5552.731.82
InternVL3.5-8B52.7349.093.6432.7341.82-9.09
Qwen3.5-9B29.0938.18-9.0950.9141.829.09
Qwen3.5-27B45.4540.005.4540.0043.64-3.64

Further Experiment

Memory Interference further experiment
Accuracy changes from video repetition. Repeating either the target or interfering video improves accuracy for most models, suggesting repetition as a promising way to enhance model memory.
Finding 2: Retroactive interference exceeds proactive interference in humans, whereas both occur comparably in multi-modal models. Further experiments surprisingly find that repeating target/noise segments can both enhance performance during interference.

Interleaved Events

Main Result

Accuracy (%) on four interleaved reconstruction metrics.

Model Source Identification Order Understanding Content Retention False Memory Discrimination
Human75.9580.0083.6482.11
Random25.0025.0025.0025.00
Closed-Source Models
Gemini-3.1-Pro-Preview43.0450.0049.0926.32
GPT-5.443.0440.0047.277.37
Open-Source Agents
VideoLucy30.3823.3343.6440.00
M3-Agent27.8540.0021.8215.79
Open-Source Models
Qwen3.5-4B30.3820.0041.8223.16
Qwen3-VL-8B-Instruct21.5223.3330.913.16
InternVL3.5-8B25.3226.6741.821.05
Qwen3.5-9B26.5840.0025.457.37
Qwen3.5-27B39.2433.3334.553.16

Further Experiment

Interleaved Events further experiment
Accuracy of source grounding. Spatial source uses the split-screen format; temporal source uses the interleaved format. Models perform notably better on spatial than temporal source grounding.
Finding 3: Multi-modal memory is less capable than human memory of structuring temporally interleaved information. Further experiments show that memory source grounding is consistently stronger along the spatial dimension than along the temporal dimension.

N-Back

Main Result

N-Back main results
Average accuracy on the N-Back task. Overall performance of each model and human under two symbolic attributes: scene and action, averaged over all K and N configurations.

Further Experiment

N-Back further experiment
Effect of N and K on accuracy. Each video clip is abstracted as a symbol. Sample points from different models are shown with fitted curves and confidence intervals.
Finding 4: Multi-modal models lag far behind humans in symbolic memory tasks. Unlike humans, models show no decay over increasing temporal distance, yet degrade sharply with increasing number of symbols, revealing a fundamental inability to filter irrelevant memory.

BibTeX

@article{m3eval2026,
  title   = {M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks},
  author  = {Huang, Jie and Liu, Ruixun and Sun, Sirui and Yang, Xinyi and Li, Yin and Zhu, Yixin and Zhong, Yiwu},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026},
}