CompletionKit: test and improve your AI app
Skip to content
Open source · Hosted Cloud, standalone app, or Rails engine
Test your AI app properly. Improve it with confidence.
Every change to your AI is a guess until you test it. CompletionKit runs it against real inputs, scores the outputs, and shows you what improved and what broke before it ships.
Start free on Cloud →
View on GitHub
Free + self-hostable · OpenAI · Anthropic · Ollama · 100+ via OpenRouter
Read the docs · Browse the blog · API reference
runs / anatomy of a run
support-reply<br>v4<br>claude-haiku-4-5<br>✓ complete · 200 / 200 scored
avg<br>4.3/5<br>↑ +0.5 vs v3
inputoutputmetricsscore
order #1421 lost in transit<br>I'm so sorry your order didn't arrive — that's frustrating…
···4.6
refund denied, want to escalate<br>I hear you. Let me walk through why this was flagged…
···4.2
delivery delayed by 4 days<br>Four days is a long wait. Here's where your package is…
···4.8
account locked after password reset<br>Let's get you back in. I can verify by email…
···4.1
charged twice for the same order<br>I see both charges — refunding the duplicate now…
···4.5
product arrived damaged<br>That shouldn't happen. I'll send a replacement today…
···3.9
+ 194 more rows<br>metrics: empathy · clarity · action · policy
Publish v4<br>Suggest improvements
01The problem
You changed the prompt. It feels better. You need to know.
"Seems better" isn't a metric you can ship behind — but without the right tools, it's usually all you've got.
You're shipping on vibes.
Without numbers, "looks right" is the bar — and "looks right" depends entirely on which inputs you happened to test.
Prompts drift, silently.
A one-line tweak to a string literal doesn't show up in review as "prompt change." Behavior shifts in prod, and nobody can point to when or why.
Every fix risks a quiet regression.
v4 nails the case that prompted the change. The inputs you didn't think to retest? They might be worse — and you'd never know.
02How a run works
Four moves. Then you have evidence.
A run is the unit of work — reproducible by prompt, dataset, model, metrics.
Write a prompt.<br>Template with {{vars}}. Upload a CSV of real inputs. CompletionKit merges them, one prompt per row.
Run a model.<br>OpenAI, Anthropic, Ollama, or any of 100+ via OpenRouter. Same dataset, any number of models in parallel.
Score against your metrics.<br>An LLM judge scores each output 1–5 on the metrics you define. Empathy, clarity, policy — whatever good means for you.
Iterate. Re-run.<br>Edit and the prompt forks a new version. Ask CompletionKit to suggest one, grounded in the judge's feedback on your data.
Got outputs already — production logs, another model, hand-curated examples? Skip steps 1 and 2: a judge-only run grades any column in your dataset against your metrics, no generation needed.
03The unlock
Point your coding agent at CompletionKit. Walk away.
REST, MCP, the standalone Bench — same shape. Plug Claude Code or Cursor into the MCP server and tell it to make a prompt better. It does.
THE LOOP YOUR CODING AGENT RUNS
Revise the prompt
Run it on your data
Judge scores it
Inspect the misses
until it clears<br>your bar
Plug Claude Code or Cursor into the MCP server and it runs this loop on its own, around and around, until the prompt clears the bar you set.
You set the bar.
Define the metrics and the score a prompt has to clear. That judgment is yours; it's the one part of the loop the agent can't do for you.
The agent runs the loop.
Point Claude Code or Cursor at the MCP server. It revises the prompt, runs it on your dataset, and re-scores, pass after pass, with no babysitting.
You approve the result.
It stops at the first version that clears your bar and hands it back. Review the diff and ship it, or send it round again.
04Compared
What you actually get, line by line.
Every tool here is doing real work. CompletionKit is the shape we kept needing.
OpenAI Evals<br>Workbench<br>Braintrust<br>Langfuse<br>Promptfoo<br>CompletionKit
Multi-provider
Local models (Ollama)
Custom scoring metrics
Partial
Partial
AI suggestions from your data
Generic
Versioned prompts via API
MCP server
Free + self-hostable
Partial
Partial
05Three ways to run it
Same product. Three ways to ship it.
Cloud is the fastest start. The standalone app is for self-hosting on your own infra. The engine is a Ruby gem you mount into an existing Rails app. Same code underneath — pick the deployment that fits.
Fastest start
Cloud
Hosted by us · ready now
Sign up, paste your provider keys, run your first eval in under five minutes.
Free tier: 20 runs / month
Bring your own provider keys
Team workspaces and roles
Start free →
Self-host
Standalone app
Bundled Rails app · deploy it yourself
Same webapp, your infra. Clone the repo, point at a Postgres, run web + worker.
No multi-tenancy, no phone-home
Provider keys via env or Settings
Source-available · BSL 1.1
View on GitHub
Embed
Rails engine
Mount in your existing Rails 8...