Model Provider Variance in Structured Extraction · Guilherme Costa
G"<br>/>
TL;DR
We spent several hours debugging what looked like a<br>prompt-engineering issue in a structured extraction<br>pipeline. This turned out not to be a prompt issue.
The<br>same OpenRouter model<br>(google/gemini-3-flash-preview)<br>was being routed to different upstream<br>providers, and<br>one provider consistently failed to<br>extract certain fields when assistant history contained<br>list-shaped content.
Practical lesson:
log the routed provider
treat provider identity as part of the request<br>fingerprint
be careful with latency-based routing on<br>structured-output workloads
For structured-output workloads, provider choice can be<br>part of correctness, not just latency or cost.
We run an LLM workflow that extracts structured<br>constraints from conversation history and feeds them<br>into a planning state machine.
One of the user flows kept failing. The conversation<br>looked roughly like this:
User: "I want to go to Brazil in<br>two weeks."
Assistant: suggests destination<br>cities.
User: "Let's go for Rio! We're two<br>adults. No specific dates. Budget less than 5000<br>euros."
At that point the planner should advance toward<br>itinerary generation, given that it had captured all<br>necessary constraints: destinations, travellers, budget<br>and date flexibility.
Instead, the extraction layer returned:
"destinations": [{ "type": "fixed", "options": ["Rio"] }],<br>"travellers": { "adults": 2 }
The budget and date flexibility were missing.
The state machine interpreted this as incomplete<br>planning state and routed the conversation back into<br>information gathering.
Theory 1: prompt specificity. Our first<br>assumption was that the prompt probably was not explicit<br>enough. The extraction prompt mentioned budget mostly in<br>negative examples ("don't infer budget from destination,<br>etc."), so we added explicit extraction guidance:
Set budget whenever the user explicitly states a<br>price expectation, whether numeric or qualitative.
No meaningful improvement.
Theory 2: schema shape. Maybe the<br>optional-field schema was making it too easy for the<br>model to emit a minimally valid object and stop early.<br>We tried two schema changes:
Add a mandatory field-scan checklist, in case the<br>model was skipping optional fields too quickly.
Add a wrapper schema forcing every field into either<br>extracted or<br>not_mentioned, in case the model was<br>emitting a minimally valid object and stopping<br>early.
None fixed it. The wrapper actually made things worse.
In some runs the model confidently classified explicitly<br>stated fields as "not mentioned":
"extracted": {<br>"destinations": [{ "type": "fixed", "options": ["Rio"] }],<br>"travellers": { "adults": 2 }<br>},<br>"not_mentioned": [<br>"budget",<br>"flexible_dates"
Theory 3: conversation history. Maybe<br>assistant history was contaminating extraction.<br>Removing assistant history made the extraction succeed<br>reliably. Reintroducing assistant history caused<br>failures to return.
Initially we thought the numerical values in the<br>assistant response were confusing the model, such as:
flight prices
EUR amounts
durations
But the pattern was more specific than that.
Assistant history<br>Result
Full destination suggestions with prices
Fail
Trailing conversational question only<br>Pass
Numbered destination list without prices
Fail
No assistant history<br>Pass
The failure now looked more correlated with the<br>shape of the assistant message than with its<br>size or numerical content: a long conversational<br>paragraph passed, while a short numbered list failed. We<br>were close to implementing an architectural workaround<br>and stripping suggestion lists before extraction, but<br>first we reran the failing case several times to confirm<br>the pattern. It did not hold. One run passed<br>unexpectedly, then failed again, then passed. Something<br>outside the prompt, schema and visible message history<br>had to be changing.
After the prompt and schema theories failed, we started<br>looking for anything that could differ between otherwise<br>identical extraction calls. The application was using<br>OpenRouter with this model:
model: "google/gemini-3-flash-preview"
and latency-based routing:
sort: "latency"
That meant two calls with the same model name, prompt,<br>schema and temperature could still be routed to<br>different upstream providers depending on real-time<br>latency. We were not logging the routed provider (in<br>retrospect, we should have been). We added a small<br>helper:
export function extractProvider(result: unknown): string | null {<br>const r = result as {<br>providerMetadata?: { openrouter?: { providerName?: string } };<br>response?: { body?: { provider?: string } };<br>};
return (<br>r?.providerMetadata?.openrouter?.providerName ??<br>r?.response?.body?.provider ??<br>null<br>);
Once provider information was visible in logs, the<br>behavior became reproducible almost immediately.
We reran the evaluation:
5 runs per provider
multiple history shapes
temperature 0
identical prompt and schema
Provider<br>Pass rate
Google AI Studio<br>20 / 20
Google...