Offline ablation predicted -0.19pp. Production delivered and1.11pp

flyback2 pts0 comments

Offline Ablation Predicted -0.19pp. Production Delivered +1.11pp.<br>← EngineeringThe number we had was −0.19pp. Four seeds, same validation cohort as production, the same harness methodology that had already confirmed two previous findings. By every standard we had set for ourselves, it was a ship.

We merged it and ran the production retrain. The candidate came back at +1.11pp against the 14.09% incumbent . Not −0.19. Plus 1.11. A gap of roughly 1.3 points between what the offline harness predicted and what we actually got in production.

Experiment<br>Offline Prediction<br>Production Result<br>Root Cause

Best Offer feature<br>Slight improvement<br>+0.12pp regression<br>Train/serve skew

Auction data backfill<br>Roughly neutral<br>+0.37pp regression<br>Unmeasured distribution shift

Outlier trimming<br>−0.19pp improvement<br>+1.11pp regression<br>Training population shift

CatBoost encoder<br>−0.199pp improvement<br>~0 (noise)<br>Baseline instability

Four different experiments. Four different failure modes. The same outcome: the offline evaluation was confidently wrong.

The context

In the last post we showed a feature that ranked first by gain importance across every seed and quantile and still degraded the model out of sample. The lesson was that gain based importance is a training metric, and at a noise floor it will rank a feature that fits training label variance without generalizing any of it.

The immediate response to that lesson is an offline ablation. Instead of trusting importance rankings, retrain with and without the thing under test on a held out split and measure the delta directly. This is what we did to kill the encoder from the first post. It worked cleanly there: offline said −0.28pp, we removed it, confirmed in production, done.

So we kept using it for everything. Offline ablation became the tool we reached for whenever we wanted to know if something was worth shipping. It is a reasonable tool. It is also the tool that told us, confidently and reproducibly, that a massive regression was an improvement.

The model is currently at 13.4% MAPE , down from 14.27% when the first post was written. Offline ablation correctly identified several successful improvements. The four experiments below were not among them. All four were blocked before they reached users, which is the only reason the number moved in the right direction rather than the wrong one.

Four experiments the harness got wrong

Each of these followed the same loop: a plausible hypothesis, an offline ablation that said "this looks fine" or "this is better," and a production retrain or follow-up measurement that said something materially different.

1. Best Offer as a training feature

A dominant fraction of our eBay sold comps close via accepted Best Offer rather than the listing price. Those transactions carry a systematic price premium over standard fixed price sales. An is_best_offer binary flag for historical sold rows looked like free signal: real market information that our feature matrix was not capturing.

The offline ablation said the feature was neutral to slightly positive. The production retrain came back at +0.12pp . Small, but in the wrong direction.

The cause is obvious in hindsight and is the cleanest example of a whole category of failures: train/serve skew . Our historical sold listings carry the flag because we have that information for completed transactions. But our active listings (the things the model is actually scoring at inference time) hardcode it to zero because a listing's offer status is unknown until the transaction clears. We had trained the model on a feature it would never observe at serve time.

The offline harness never caught this, and structurally it cannot. Both its training set and its validation split come from the same historical sold data where the flag exists everywhere. The skew is invisible to any evaluation that stays inside historical data. You only see it when the model meets live listings that have never seen the feature in the distribution it was trained on.

2. An auction house, parsed and backfilled

Our auction house ingestion services had a regex bug in the title parser that was killing the brand extraction step. We fixed the regex, backfilled the correction, and added roughly 1,000 new verified actual auction hammer prices to the training set (a 19× jump in high confidence matches from a source that had been nearly empty).

More real, accurate data should strictly improve the model. The offline ablation was ambiguous: roughly flat with a slight downward drift in some seeds. We treated that as a green light.

Production retrain came back at +0.37pp . Reverted.

The data was correct. The mechanism was distribution shift that the offline harness did not simulate. Auction houses systematically clear lower than retail marketplaces for the same reference; we verified this after the fact: the median auction house hammer price for identical references was roughly 15% below equivalent eBay or dealer transactions. Injecting...

offline production ablation feature four auction

Related Articles