Same model, different provider, different structured output

Model Provider Variance in Structured Extraction · Guilherme Costa

G" />

TL;DR

We spent several hours debugging what looked like a prompt-engineering issue in a structured extraction pipeline. This turned out not to be a prompt issue.

The same OpenRouter model (google/gemini-3-flash-preview) was being routed to different upstream providers, and one provider consistently failed to extract certain fields when assistant history contained list-shaped content.

Practical lesson:

log the routed provider

treat provider identity as part of the request fingerprint

be careful with latency-based routing on structured-output workloads

For structured-output workloads, provider choice can be part of correctness, not just latency or cost.

We run an LLM workflow that extracts structured constraints from conversation history and feeds them into a planning state machine.

One of the user flows kept failing. The conversation looked roughly like this:

User: "I want to go to Brazil in two weeks."

Assistant: suggests destination cities.

User: "Let's go for Rio! We're two adults. No specific dates. Budget less than 5000 euros."

At that point the planner should advance toward itinerary generation, given that it had captured all necessary constraints: destinations, travellers, budget and date flexibility.

Instead, the extraction layer returned:

"destinations": [{ "type": "fixed", "options": ["Rio"] }], "travellers": { "adults": 2 }

The budget and date flexibility were missing.

The state machine interpreted this as incomplete planning state and routed the conversation back into information gathering.

Theory 1: prompt specificity. Our first assumption was that the prompt probably was not explicit enough. The extraction prompt mentioned budget mostly in negative examples ("don't infer budget from destination, etc."), so we added explicit extraction guidance:

Set budget whenever the user explicitly states a price expectation, whether numeric or qualitative.

No meaningful improvement.

Theory 2: schema shape. Maybe the optional-field schema was making it too easy for the model to emit a minimally valid object and stop early. We tried two schema changes:

Add a mandatory field-scan checklist, in case the model was skipping optional fields too quickly.

Add a wrapper schema forcing every field into either extracted or not_mentioned, in case the model was emitting a minimally valid object and stopping early.

None fixed it. The wrapper actually made things worse.

In some runs the model confidently classified explicitly stated fields as "not mentioned":

"extracted": { "destinations": [{ "type": "fixed", "options": ["Rio"] }], "travellers": { "adults": 2 } }, "not_mentioned": [ "budget", "flexible_dates"

Theory 3: conversation history. Maybe assistant history was contaminating extraction. Removing assistant history made the extraction succeed reliably. Reintroducing assistant history caused failures to return.

Initially we thought the numerical values in the assistant response were confusing the model, such as:

flight prices

EUR amounts

durations

But the pattern was more specific than that.

Assistant history Result

Full destination suggestions with prices

Fail

Trailing conversational question only Pass

Numbered destination list without prices

Fail

No assistant history Pass

The failure now looked more correlated with the shape of the assistant message than with its size or numerical content: a long conversational paragraph passed, while a short numbered list failed. We were close to implementing an architectural workaround and stripping suggestion lists before extraction, but first we reran the failing case several times to confirm the pattern. It did not hold. One run passed unexpectedly, then failed again, then passed. Something outside the prompt, schema and visible message history had to be changing.

After the prompt and schema theories failed, we started looking for anything that could differ between otherwise identical extraction calls. The application was using OpenRouter with this model:

model: "google/gemini-3-flash-preview"

and latency-based routing:

sort: "latency"

That meant two calls with the same model name, prompt, schema and temperature could still be routed to different upstream providers depending on real-time latency. We were not logging the routed provider (in retrospect, we should have been). We added a small helper:

export function extractProvider(result: unknown): string | null { const r = result as { providerMetadata?: { openrouter?: { providerName?: string } }; response?: { body?: { provider?: string } }; };

return ( r?.providerMetadata?.openrouter?.providerName ?? r?.response?.body?.provider ?? null );

Once provider information was visible in logs, the behavior became reproducible almost immediately.

We reran the evaluation:

5 runs per provider

multiple history shapes

temperature 0

identical prompt and schema

Provider Pass rate

Google AI Studio 20 / 20

Google...

Same model, different provider, different structured output

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits