Same model, different provider, different structured output

guilhermesfc1 pts0 comments

Model Provider Variance in Structured Extraction · Guilherme Costa

G"<br>/>

TL;DR

We spent several hours debugging what looked like a<br>prompt-engineering issue in a structured extraction<br>pipeline. This turned out not to be a prompt issue.

The<br>same OpenRouter model<br>(google/gemini-3-flash-preview)<br>was being routed to different upstream<br>providers, and<br>one provider consistently failed to<br>extract certain fields when assistant history contained<br>list-shaped content.

Practical lesson:

log the routed provider

treat provider identity as part of the request<br>fingerprint

be careful with latency-based routing on<br>structured-output workloads

For structured-output workloads, provider choice can be<br>part of correctness, not just latency or cost.

We run an LLM workflow that extracts structured<br>constraints from conversation history and feeds them<br>into a planning state machine.

One of the user flows kept failing. The conversation<br>looked roughly like this:

User: "I want to go to Brazil in<br>two weeks."

Assistant: suggests destination<br>cities.

User: "Let's go for Rio! We're two<br>adults. No specific dates. Budget less than 5000<br>euros."

At that point the planner should advance toward<br>itinerary generation, given that it had captured all<br>necessary constraints: destinations, travellers, budget<br>and date flexibility.

Instead, the extraction layer returned:

"destinations": [{ "type": "fixed", "options": ["Rio"] }],<br>"travellers": { "adults": 2 }

The budget and date flexibility were missing.

The state machine interpreted this as incomplete<br>planning state and routed the conversation back into<br>information gathering.

Theory 1: prompt specificity. Our first<br>assumption was that the prompt probably was not explicit<br>enough. The extraction prompt mentioned budget mostly in<br>negative examples ("don't infer budget from destination,<br>etc."), so we added explicit extraction guidance:

Set budget whenever the user explicitly states a<br>price expectation, whether numeric or qualitative.

No meaningful improvement.

Theory 2: schema shape. Maybe the<br>optional-field schema was making it too easy for the<br>model to emit a minimally valid object and stop early.<br>We tried two schema changes:

Add a mandatory field-scan checklist, in case the<br>model was skipping optional fields too quickly.

Add a wrapper schema forcing every field into either<br>extracted or<br>not_mentioned, in case the model was<br>emitting a minimally valid object and stopping<br>early.

None fixed it. The wrapper actually made things worse.

In some runs the model confidently classified explicitly<br>stated fields as "not mentioned":

"extracted": {<br>"destinations": [{ "type": "fixed", "options": ["Rio"] }],<br>"travellers": { "adults": 2 }<br>},<br>"not_mentioned": [<br>"budget",<br>"flexible_dates"

Theory 3: conversation history. Maybe<br>assistant history was contaminating extraction.<br>Removing assistant history made the extraction succeed<br>reliably. Reintroducing assistant history caused<br>failures to return.

Initially we thought the numerical values in the<br>assistant response were confusing the model, such as:

flight prices

EUR amounts

durations

But the pattern was more specific than that.

Assistant history<br>Result

Full destination suggestions with prices

Fail

Trailing conversational question only<br>Pass

Numbered destination list without prices

Fail

No assistant history<br>Pass

The failure now looked more correlated with the<br>shape of the assistant message than with its<br>size or numerical content: a long conversational<br>paragraph passed, while a short numbered list failed. We<br>were close to implementing an architectural workaround<br>and stripping suggestion lists before extraction, but<br>first we reran the failing case several times to confirm<br>the pattern. It did not hold. One run passed<br>unexpectedly, then failed again, then passed. Something<br>outside the prompt, schema and visible message history<br>had to be changing.

After the prompt and schema theories failed, we started<br>looking for anything that could differ between otherwise<br>identical extraction calls. The application was using<br>OpenRouter with this model:

model: "google/gemini-3-flash-preview"

and latency-based routing:

sort: "latency"

That meant two calls with the same model name, prompt,<br>schema and temperature could still be routed to<br>different upstream providers depending on real-time<br>latency. We were not logging the routed provider (in<br>retrospect, we should have been). We added a small<br>helper:

export function extractProvider(result: unknown): string | null {<br>const r = result as {<br>providerMetadata?: { openrouter?: { providerName?: string } };<br>response?: { body?: { provider?: string } };<br>};

return (<br>r?.providerMetadata?.openrouter?.providerName ??<br>r?.response?.body?.provider ??<br>null<br>);

Once provider information was visible in logs, the<br>behavior became reproducible almost immediately.

We reran the evaluation:

5 runs per provider

multiple history shapes

temperature 0

identical prompt and schema

Provider<br>Pass rate

Google AI Studio<br>20 / 20

Google...

provider model history extraction prompt assistant

Related Articles