GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem | Project Page

GateMem Project Page Utility · Access Control · Active Forgetting

Paper Code Dataset Leaderboard

Shared-memory governance benchmark

GateMem

Benchmarking memory governance in multi-principal shared-memory agents.

GateMem evaluates whether persistent memory agents can remain useful while enforcing requester-specific access boundaries and honoring deletion requests. It shifts memory evaluation from single-user recall toward governed shared memory in realistic institutional environments.

Read Paper Explore Code Download Dataset View Leaderboard Submit Results

Memory is no longer just recall.

In shared environments, the same memory bank is queried by different principals under different roles, scopes, and relationships.

Utility Authorized requesters should receive useful, current, in-scope answers.

Access Control Unauthorized or over-scoped requesters should not receive protected information.

Active Forgetting Deleted information should not be recovered, confirmed, or reconstructed.

MGS = U · (1 − A) · (1 − F)

Higher is better for U and MGS; lower is better for A and F.

91 long-form episodes

2,218 hidden checkpoints

4 institutional domains

6 backbone LLMs

7 memory baselines

Overview

From remembering information to governing shared memory.

Conventional memory benchmarks often reward an agent for retrieving the right fact. GateMem asks a harder deployment question: whether the agent should reveal that fact to the current requester, and whether deleted information remains recoverable later.

What GateMem measures

GateMem treats persistent memory as a governed shared state rather than a private cache. The benchmark evaluates long-horizon usefulness, contextual authorization, and interface-level deletion compliance in one protocol.

1Requester-specific memory use The same fact may be safe for one principal and protected from another.

2Policy-aware boundary decisions Agents must handle roles, relationships, delegated access, and plausible overreach.

3Post-deletion non-recovery Deletion is evaluated through later interaction behavior, including confirmation and reconstruction attacks.

governed shared state Shared Memory Bank policy · provenance · deletion

PPrincipalowner

CClinicianauthorized

MManagerscoped

GGuestrestricted

GateMem shifts evaluation from single-principal memory recall to multi-principal shared-memory governance.

Benchmark

Long-form episodes with hidden governance checkpoints.

Each episode instantiates principals, relationships, access rules, evolving facts, and deletion requests. Hidden checkpoints query the agent at selected turn boundaries and are judged using structured annotations and leak targets.

Stage 01 Scenario design Define domain, principals, roles, relationships, and scoped access rules.

Stage 02 Episode construction Generate long-form multi-party traces with updates, benign noise, and deletion events.

Stage 03 Checkpoint evaluation Insert hidden utility, access-control, and active-forgetting queries with judge specifications.

Dataset construction pipeline with domain policy design, episode construction, and hidden checkpoint generation.

DOMAIN 01 🩺

Medical Clinical coordination, patient data, family delegation, cross-patient confusion, and protected lab or medication details.

DOMAIN 02 💼

Office Project confidentiality, HR records, contractor boundaries, role mismatches, and enterprise workflows.

DOMAIN 03 🎓

Education Campus workflows, student support, counselor interactions, academic records, and scoped institutional access.

DOMAIN 04 🏠

Household Family coordination, residents, guests, caregivers, access codes, care routines, and deleted household instructions.

Results

Current memory systems are useful, but not yet governed.

Across backbone LLMs and memory architectures, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting. High recall often comes with leakage risk.

Key findings

Long-context prompting is strong but costly. Full history provides maximal evidence for authorized queries but still exposes protected or deleted information.

Policy-aware retrieval improves safety. Requester and access-policy metadata reduce leakage, but often trade off utility through missing evidence or over-refusal.

External memory is not governance by default. Structured memory systems still need explicit authorization and deletion-aware controls.

Leaderboard available Compare methods by domain and by MGS, Utility, Access Safety, and Forgetting Safety.Open Leaderboard

Judge-based main results across backbone LLMs and domains. The official leaderboard provides interactive domain-level views.

Use GateMem

Run locally or submit online.

GateMem supports local evaluation through the released codebase and online leaderboard submission through the Hugging Face submission interface.

Local evaluation

Implement a memory agent or score a generated predictions.jsonl file with the...

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org