Benchmarking four open-source geo-experiment tools against known ground truth

Open-Source Geo-Experiment Tools — A Head-to-Head Simulation Study | Recast

Get notified when V2 lands

One email when the next study in this series is published. Nothing else.

Notify me<br>You're on the list — we'll email you when V2 is out.

Something went wrong — please try again.

CausalPy vs. Google Matched Markets vs. Meta GeoLift vs. Google Causal Impact: A Head-to-Head Simulation Study

We compared four leading open-source geographic-based experimentation tools — CausalPy , Google Matched Markets , Google Causal Impact , and Meta GeoLift — by running each one on thousands of simulated datasets where the true incremental campaign effect is known beforehand. Throughout this report, "effect" means this incremental lift — the sales a campaign genuinely causes, beyond what would have happened anyway.

CausalPy

Google MM

Meta GeoLift

Causal Impact

In this study, every tool sees the same data, the same treatment and control markets, and is scored against the same ground truth. The only thing that differs is the tool being used for the analysis.

Each simulated dataset is daily sales across many markets, ending with a two-week test period. In that window, one market gets either a 7.5% incremental sales lift (to see whether each tool detects a real effect) or no lift at all (to see how often each tool declares a false win). Each market behaves like a real one: it has its own size, grows slowly over time, has good days and bad days of the week, and experiences random noise where a rough patch tends to bleed into the next day rather than resetting completely.

For the full methodology and technical details, see the Appendix.

Scenarios —Four operating conditions practitioners actually face.

We tested the tools across four scenarios. Each represents a real-world operating condition you'll meet in the course of running an experimentation program.

Baseline

This is clean, well-behaved data with one treated geo and a diverse pool of 20 control markets.

Outlier market

The treated market is 5x larger than the median city in the control pool. In other words, what happens when your CMO insists on testing in New York?

Small control pool

There are only 9 control markets instead of 20, a common issue for companies operating in smaller markets.

Short calibration window

This has 30 days of pre-treatment data instead of 90. Again, a common occurrence where business pressure forces a brand to launch an experiment before they have sufficient pre-experiment data.

Figure 1. Simulated market data for each scenario, shown with no campaign effect applied. Gray lines represent individual control markets. Blue: control average. Coral: treated market. Dashed vertical line: campaign start.

Across these scenarios, no tool escaped the tradeoff between false positives and false negatives. The interesting question is which tradeoff each one made.

Key findings —What we learned across 32,000 model fits.

Finding 01

There's no free lunch — every tool forces a tradeoff between two kinds of mistakes.

Every tool in this study forces a choice between two errors: false positives (declaring a winning experiment that isn't real) and false negatives (declaring an experiment inconclusive when a real incremental effect exists). No tool escapes the tradeoff. Which mistake is more expensive for your business is the question that should drive which tool you pick.

Figure 2. Uncertainty intervals per tool × scenario, 7.5% effect condition. 50 iterations from the 1,000 simulation runs. Dashed vertical line marks the true ATT = 7.5%. Green = the CI contains 7.5%; red = it misses. The % in each subplot's top-right corner is empirical coverage on those 50 bars (green if ≥ 90%, orange 70–89%, red

When a tool provides a confidence interval — for example, "the incremental lift is between 5.2% and 10.1% with 95% confidence" — that interval is supposed to contain the true answer 95% of the time. Keeping that promise comes at a cost. A tool can produce wide intervals that reliably contain the true lift. But if those intervals become so wide that they contain zero, the experiment can't declare a statistically significant effect, even when a real one exists.

On the other hand, a tool can produce tighter, more decisive intervals. But those intervals will miss the true incremental lift more often and generate more false alarms. No tool in this study escapes this tradeoff.

Meta GeoLift

Meta GeoLift is the strongest performer on three of four metrics under consideration: its coverage is closest to the 95% target (92–95%), it has the lowest false positive rate (3–5%), and its point estimates are closest to the true incremental lift in most scenarios. The tradeoff is in its ability to detect real incremental effects, where Meta GeoLift's confidence intervals are wide enough such that they frequently contain both the true effect and zero at the same time. This conservatism can become a limitation for practitioners making decisions with...

Benchmarking four open-source geo-experiment tools against known ground truth

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi