memory retrieval with code? - Piyussh
Piyussh
SubscribeSign in
memory retrieval with code?<br>why this might be better than semantic search for your use-case.
Piyussh<br>Jun 23, 2026
Share
In the past year, I’ve implemented two types of memory in LLM by making two types of tools and doing tool calls.<br>One memory used in Colyap.com is a multi-layered semantic search where in at end of call, the post call analysis on call transcription will store in 4 types of columns and each are embedded and then semantic search against each past call transcript using cosine similarity is done during a live call on LLM’s discretion for type of memory bank it should look for based on user's query, it maybe sometimes look in all 4 memory banks which can be optimized with evals. It works well and for some reasons I feel its better than just doing semantic search on one whole chunk of all call transcriptions. However, this works for a set of areas fit in Colyap’s case and assumption on what people might call in and speak about with Colyap. This is also a little expensive since you might have some redundant calls for embedding but delivers better (essential) user experience. Jury’s still out for how this will scale. Right now it works well for 6 months of user call transcriptions, but imagine a decade. There has to be recency bias then factored in for this.<br>Recency bias becomes important once memory spans years, because cosine similarity alone treats a highly relevant memory from 2020 and a highly relevant memory from 2025 almost the same, even though in a live call the newer one is usually more likely to reflect the user’s current reality. The way this can work is you don’t replace semantic similarity, you modify the final retrieval score with a time-decay weight. For example, every memory chunk still gets its normal semantic score, say cosine_score = 0.82, but then you multiply or blend it with a recency weight based on how old that memory is. A simple version is exponential decay: recency_weight = e^(-lambda * age_in_days), where lambda controls how aggressively older memories lose priority. So the final score may become something like final_score = cosine_score * recency_weight, or more practically final_score = alpha * cosine_score + beta * recency_weight, because sometimes older memories are still important and you don’t want time to completely disregard older context. If the user mentioned tooth removal in 2020 and again in 2025, both can still show up, but the 2025 memory gets a higher final score unless the 2020 one is semantically much stronger or has a special “durable fact” tag. You can also make the decay different by memory type: preferences and current life context should decay faster, while identity facts, medical history, relationships, business details, or major life events decay slower. This lets the memory system behave less like a dumb vector database and more like a useful assistant that understands that “what matters now” is not always the same as “what was once said.”
What about multiple instances of same event happened in different timestamp- user shared about getting their tooth removed in 2020 and then mentions it again in 2025, semantic search will return both instances with timestamps and that’s fine but what about other context which might be relevant before or after that. Maybe you bring in the whole call transcription raw after semantic search is done on that query. That might work but what about more complicated n-degree connections. Maybe you write your tool so LLM does multiple reads - you mention on call, “my dog, my cat, my car, my last months trip”. Note, let’s say you want to bring in 500 tokens for each chunk and you pass in the top 4 matches that means you’re bringing in 2000 tokens for each topic which might sound not very expensive and it’s not (0.006$ for claude sonnet 4.6) for normal chat sessions, but for voice agents it get’s expensive and increases latency and I’m not sure if you’re using a third party voice provider and you get LLM from them they provide you the discount of prompt caching like the frontier labs does.
Look at the above health entities diagram if there is a different chunk of conversation which holds one key integral part of information for the LLM to form a whole picture of the problem, how will retrieval work here? This kind of retrieval doesn’t need to be time constrained since there is so many degrees of connections. If I don’t think in terms of the Bitter Lesson, and assume it’s our job to help LLM traverse
The other type of memory we implemented to solve issues for a finance agent we built. The problem you pull in 100,000 rows of financial transactions data from Plaid but you can’t pass it to LLM context so what do you do? Give the LLM your memory as structured JSON, and let it write code to query it.<br>This works because our financial data from Plaid specially is canonical and non ambiguous.<br>"id": "txn_8f3a2c",<br>"account_id": "acc_001",<br>"date":...