Agentic search - retrieval, harness, or model?
-->
Don't fill this out if you're human:
Doug's search blog and newsletter
Subscribe
Learn more
Check your email to confirm your subscription
back home
Agentic search - retrieval, harness, or model?
June<br>8th,
2026
Agentic search gets interesting when agents do not know how to find the right answer.
Oh, the agent might think it knows. It might confidently BS us. But the agent’s poor domain intuition steers itself astray.
Agents make false assumptions about what our users think is relevant. Our fashionista users think “red shoes” should return high-heels. When I worked at one company ABE wasn’t a president, it was an A/B testing tool. Agents need context to know these things - and context engineering needs agentic search
Confusingly, depending who you talk to, agentic search can mean one of three distinct implementation patterns:
Retrieval-centric - just build good search so agents can use it to fill in missing context
Harness-centric - steer the agent towards needed context, even with bad search
Model-centric - fine tune an LLM to know how to search our data
In this post, I’ll walk through each. With a bit more breadth, you might appreciate which flavor your colleagues seem to mean when they say “agentic search”
Retrieval-Centric Implementation
When frontier models don’t know, they search. Ask ChatGPT a question about news, it’ll search. Ask it about a very specific technical problem, it’ll search. During training, LLMs see search examples as a technique to learn what it doesn’t know. To find what it needs.
Therefore, we just need to build good search to solve any conceivable query from an agent.
Let’s assume we run an e-commerce catalog. For us, when users search “red shoes” they mean “red high heels”. But the agent doesn’t know that. Luckily it asks search. Below we see it returns the right results.
Below we see initial retrieval to lexical / vector backends pulls back some reasonable, if naive, “red shoe” candidates. Still that’s not quite right. Luckily the reranker shapes the results towards our understanding of that intent.
Of course we might have other components here - query understanding, diversity, custom embedding models, and more.
The important point is that search leads the agent by the nose towards what’s relevant. We assume search can define what a good “red shoe” is, overriding the agent’s perspective.
When RAG answers don’t look like answers
Most teams build retrieval-centric approaches with classic RAG. Chunks of answers, and embeddings trained to recognize them as answers.
Unfortunately, answers don’t always look tied to the question. For example:
Question:
Synopsis of the book Ubik
Answer:
By the year 1992, humanity has colonized the Moon and psychic powers are common. The protagonist, Joe Chip, is a debt-ridden technician working for Runciter Associates, a “prudence organization” employing “inertials”—people with the ability to negate the powers of telepaths and “precogs”—to enforce the privacy of clients.
If you don’t know the book Ubik, it’s not clear that this answers the question. The agent says “cool story bro” and ignores the info.
Search like this actually is divorced from the web search trusted by fronteir models. The web contains titles, headings, and other elements placing the answer in context.
The big downside? Search remains hard. We’re not Google.
Most importantly, we don’t have perfect search like Google or Bing. Even with good search - almost nobody builds google quality results. And since agents trust search - simple distractors in retrieval can, as Lester Solbakken says, easily confuse reasoning.
Harness-centric implementations
Who should be in charge? Should the agent manage the search process? Or should search drive the agent?
Moving the emphasis to the harness, we put the agent in charge.
Imagine stripping search tools down to core retrieval primitives. Just a BM25 backend. Or a filesystem with CLI tools. We tell the agent find to what it needs with these untuned tools. The agent might struggle more, but hey, its smart, it can get figure it out. Right?
Still, the agent might find what it thinks is relevant, as it does in the image below. Sadly that’s not what’s actually relevant. To help the agent, we inject external knowledge. We let a judge direct the agent, correcting its mistakes, and guiding it towards better search strategies.
So when the agent returns results, it’s not the “user” that receives the candidates. Instead a judge labels results as relevant or not. The agent gets the hint, finding results similar to those labeled relevant. Avoiding those labeled irrelevant. To steel from Jo Kristian Bergem, this is relevance feedback on steroids.
This works. Look at the ESCI dataset. If I have an oracle labeling with judgments from these datasets, I get quite significant improvements.
Variant<br>NDCG@10<br>Description
ESCI BM25<br>0.2895<br>Simple BM25 weighing name / description
ESCI...