GateMem | Project Page
GateMem Project Page<br>Utility · Access Control · Active Forgetting
Paper<br>Code<br>Dataset<br>Leaderboard
Shared-memory governance benchmark
GateMem
Benchmarking memory governance in multi-principal shared-memory agents.
GateMem evaluates whether persistent memory agents can remain useful while enforcing requester-specific access boundaries and honoring deletion requests. It shifts memory evaluation from single-user recall toward governed shared memory in realistic institutional environments.
Read Paper<br>Explore Code<br>Download Dataset<br>View Leaderboard<br>Submit Results
Memory is no longer just recall.
In shared environments, the same memory bank is queried by different principals under different roles, scopes, and relationships.
Utility Authorized requesters should receive useful, current, in-scope answers.
Access Control Unauthorized or over-scoped requesters should not receive protected information.
Active Forgetting Deleted information should not be recovered, confirmed, or reconstructed.
MGS = U · (1 − A) · (1 − F)
Higher is better for U and MGS; lower is better for A and F.
91 long-form episodes
2,218 hidden checkpoints
4 institutional domains
6 backbone LLMs
7 memory baselines
Overview
From remembering information to governing shared memory.
Conventional memory benchmarks often reward an agent for retrieving the right fact. GateMem asks a harder deployment question: whether the agent should reveal that fact to the current requester, and whether deleted information remains recoverable later.
What GateMem measures
GateMem treats persistent memory as a governed shared state rather than a private cache. The benchmark evaluates long-horizon usefulness, contextual authorization, and interface-level deletion compliance in one protocol.
1Requester-specific memory use The same fact may be safe for one principal and protected from another.
2Policy-aware boundary decisions Agents must handle roles, relationships, delegated access, and plausible overreach.
3Post-deletion non-recovery Deletion is evaluated through later interaction behavior, including confirmation and reconstruction attacks.
governed shared state<br>Shared Memory Bank<br>policy · provenance · deletion
PPrincipalowner
CClinicianauthorized
MManagerscoped
GGuestrestricted
GateMem shifts evaluation from single-principal memory recall to multi-principal shared-memory governance.
Benchmark
Long-form episodes with hidden governance checkpoints.
Each episode instantiates principals, relationships, access rules, evolving facts, and deletion requests. Hidden checkpoints query the agent at selected turn boundaries and are judged using structured annotations and leak targets.
Stage 01<br>Scenario design<br>Define domain, principals, roles, relationships, and scoped access rules.
Stage 02<br>Episode construction<br>Generate long-form multi-party traces with updates, benign noise, and deletion events.
Stage 03<br>Checkpoint evaluation<br>Insert hidden utility, access-control, and active-forgetting queries with judge specifications.
Dataset construction pipeline with domain policy design, episode construction, and hidden checkpoint generation.
DOMAIN 01<br>🩺
Medical<br>Clinical coordination, patient data, family delegation, cross-patient confusion, and protected lab or medication details.
DOMAIN 02<br>💼
Office<br>Project confidentiality, HR records, contractor boundaries, role mismatches, and enterprise workflows.
DOMAIN 03<br>🎓
Education<br>Campus workflows, student support, counselor interactions, academic records, and scoped institutional access.
DOMAIN 04<br>🏠
Household<br>Family coordination, residents, guests, caregivers, access codes, care routines, and deleted household instructions.
Results
Current memory systems are useful, but not yet governed.
Across backbone LLMs and memory architectures, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting. High recall often comes with leakage risk.
Key findings
Long-context prompting is strong but costly. Full history provides maximal evidence for authorized queries but still exposes protected or deleted information.
Policy-aware retrieval improves safety. Requester and access-policy metadata reduce leakage, but often trade off utility through missing evidence or over-refusal.
External memory is not governance by default. Structured memory systems still need explicit authorization and deletion-aware controls.
Leaderboard available Compare methods by domain and by MGS, Utility, Access Safety, and Forgetting Safety.Open Leaderboard
Judge-based main results across backbone LLMs and domains. The official leaderboard provides interactive domain-level views.
Use GateMem
Run locally or submit online.
GateMem supports local evaluation through the released codebase and online leaderboard submission through the Hugging Face submission interface.
Local evaluation
Implement a memory agent or score a generated predictions.jsonl file with the...