MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Paper page - MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

MemEye is a vision-centric long-term memory benchmark designed to evaluate how agents remember and reason over long-running image-grounded interactions. The benchmark focuses on assessing agents’ abilities to retain and utilize visual information across multi-session conversations, including memory of long-tail visual details, visual state updates, and evolving user-centric contexts.\nThe dataset consists of user-centric multi-session dialogues paired with associated images and human-annotated questions. Each task is provided in both multiple-choice and open-ended formats, enabling evaluation under both constrained-choice and generative settings.\n","updatedAt":"2026-05-15T02:27:47.736Z","author":{"_id":"66b82fdfcaadc51a3e9a32e1","avatarUrl":"/avatars/d91377d0e1ec85b5a2efa43bf161dfc6.svg","fullname":"Zeru Shi","name":"DarkBluee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8800291419029236},"editors":["DarkBluee"],"editorAvatarUrls":["/avatars/d91377d0e1ec85b5a2efa43bf161dfc6.svg"],"reactions":[{"reaction":"🔥","users":["MinghaoGuo","ZBox008003","Yuao0204","rrrrrrrrrrrrryan","ruwj97","RuiWu1123Alignment","0x33B"],"count":7},{"reaction":"🚀","users":["ZBox008003","MinghaoGuo","Yuao0204","ruwj97","RuiWu1123Alignment","0x33B","Globalizewe1"],"count":7},{"reaction":"❤️","users":["MinghaoGuo","Yuao0204","rrrrrrrrrrrrryan","ruwj97","RuiWu1123Alignment","0x33B"],"count":6},{"reaction":"👍","users":["MinghaoGuo","Yuao0204","ruwj97","RuiWu1123Alignment","0x33B"],"count":5}],"isReport":false}},{"id":"6a0a70aefa1cb0f90381bc4b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":356,"isUserFollowing":false},"createdAt":"2026-05-18T01:51:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models](https://huggingface.co/papers/2605.14906) (2026)\n* [UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs](https://huggingface.co/papers/2605.11856) (2026)\n* [Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning](https://huggingface.co/papers/2605.07106) (2026)\n* [ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?](https://huggingface.co/papers/2603.25823) (2026)\n* [EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding](https://huggingface.co/papers/2605.09874) (2026)\n* [MMSkills: Towards Multimodal Skills for General Visual Agents](https://huggingface.co/papers/2605.13527) (2026)\n* [MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence](https://huggingface.co/papers/2605.12703) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the Librarian Bot. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n\nMemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models (2026)\nUniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs (2026)\nRetrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning (2026)\nViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? (2026)\nEgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding (2026)\nMMSkills: Towards Multimodal Skills for General Visual Agents (2026)\nMMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence (2026)\n\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout this Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot...

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast