Why Your AI Agent Needs Better Temporal Reasoning–and How We Fixed It

Why Your AI Agent needs better Temporal Reasoning—and How We Fixed It | by Vektor Memory | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Why Your AI Agent needs better Temporal Reasoning—and How We Fixed It

Vektor Memory

10 min read· 12 hours ago

Listen

Press enter or click to view image in full size

Most agent memory systems treat stored facts linearly. There’s no sense of when a fact was true, whether it’s been superseded, or how to reason about time at all. This is the story of how we diagnosed the problem, found the research, and built a production fix in Node.js and SQLit, no Python, no subprocess, no academic overhead. Received a great question from a reader along these lines: “Temporal reasoning is the one I’d love to see more research on — most retrieval systems treat memory as a flat bag of facts and the agent has no way to know that a fact from yesterday supersedes one from last month”

Which interesting question sent us down a rabbit hole through Arxiv late in the evening before dinner. We found a handful of papers attacking the problem from different angles: temporal knowledge graphs, bi-temporal storage models, neuro-symbolic reasoning pipelines. Most of them were Python. All of them were academic. None of them were something you could drop into a production Node.js agent without a rewrite. Which brought us back to a decision we made early in VEKTOR’s life and have never regretted: Node.js over Python. Not because Python isn’t excellent, it’s the flavour of the month for good reason, and the ML ecosystem built on top of it is genuinely world-class. We chose Node.js for concurrent execution, for file I/O speed, for the event loop model that makes agent tooling feel snappy rather than sluggish. That choice closes certain doors. It also opens others. The most interesting paper we found was TReMu — Temporal Reasoning for LLM-Agents in Multi-Session Dialogues, out of UIUC and AWS. Their framework takes GPT-4o from 29% to 77% accuracy on temporal questions. The mechanism: resolve relative time expressions at ingest, then use Python to execute date arithmetic at query time. The Python part we had to throw away. What we built instead is arguably cleaner, we think so anyway. The Problem Nobody Talks About Ask your AI agent “what database are we using?” and it will confidently answer with whatever it last store, even if that fact is six months stale and three decisions out of date. This isn’t a hallucination problem. The fact is real. It was true. It’s just not true anymore. Most retrieval-augmented memory systems treat stored memories as a flat collection ranked by semantic similarity and recency. There’s no explicit model of when a fact was true in the world versus when the agent learned it. There’s no mechanism to mark a fact as superseded by a newer one. And there’s certainly no way to answer questions like “how long between when we decided on Redis and when we migrated away from it?” The agent doesn’t know what it doesn’t know about time. It lives in an eternal present, every fact equally valid, forever.

The Research: TReMu Revised 24 Sep 2025, a team from the University of Illinois and AWS published TReMu — Temporal Reasoning for LLM-Agents in Multi-Session Dialogues. The paper is worth reading in full, but the headline number is striking: standard GPT-4o prompting on temporal reasoning questions scores 29.83% . Their framework scores 77.67% . That’s a 48-point jump on questions humans find trivially easy. What were those questions? Three types: Temporal Anchoring — “When exactly did this happen?” A user says “I went to the seminar last Monday.” When was that? Most systems store the ingestion timestamp, not the event date. They’re different. Temporal Precedence — “Which of these two things happened first?” Requires knowing the actual order of events across multiple sessions, not just which was stored most recently. Temporal Interval — “How long between these two events?” Needs real date arithmetic. Not vibes. Not “a while ago.” Actual days. The paper’s solution has two parts: Time-aware memorization — at ingest time, resolve relative time expressions (“last Friday,” “two weeks ago”) into concrete calendar dates. Store the event date separately from the ingestion date. Neuro-symbolic temporal reasoning — at query time, generate Python code to perform the date arithmetic, execute it, and use the output to answer the question. Part 1 is clearly right. Part 2 is a good idea wrapped in academic scaffolding that we had to refine some more to make it work in our production architecture.

What’s Wrong With Python Subprocess The paper uses Python because it’s running in a research notebook. dateutil.relativedelta is genuinely excellent for date math. But "generate Python, exec it, parse stdout" as a production pattern has real problems: Latency — cold subprocess startup on every temporal query Security — LLM-generated code execution is...

Why Your AI Agent Needs Better Temporal Reasoning–and How We Fixed It

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy