How to Tell If Your Kafka Self-Service Is Working?

How to Tell If Your Kafka Self-Service Is Actually Working? | by Stéphane Derosiaux | Conduktor | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Conduktor

Stories from our engineering team to build the best collaborative development platform for Apache Kafka

How to Tell If Your Kafka Self-Service Is Actually Working?

How to reveal whether developers are happy with your self-service system or are just trying until the PR is green?

Stéphane Derosiaux

5 min read· Just now

Listen

Press enter or click to view image in full size

Kafka and GitOps

Most platform teams will tell you their Kafka self-service is working. The evidence they point to is almost always the same: developers can create their own topics and connectors, there are guardrails in place, and bad configs get blocked before they reach production. The cluster looks sane, it’s governed, it has self-service, it’s working great! No, it depends. “Nothing broke” and “our Github process rejects bad PRs” are not measures of whether self-service works. They measure whether your enforcement works, which is a different and much easier thing. You protect a castle, great, but a business is about adoption and avoiding frictions to do more. Real self-service has a job: let a developer who isn’t a Kafka expert provision a correct resource, quickly, without pulling the platform team in. You can have strict guardrails and still be failing at that completely. Let’s see how to identify this.

Let’s agree on what “self-serve works” means Self-service is working when three things are true at once: developers can provision what they need without filing a ticket the result is correct on the first try the platform team’s involvement per request trends toward zero. If any of those are not true, you don’t have self-service; you have a ‘slower’ ticket queue with extra steps. Why? Enforcement and guidance are two separate pieces of the system, it’s easy to build enforcement, it’s harder to build guidance. A guardrail, a policy, a CI check, a rejected pull request, a naming-convention regex, answers one question: is this allowed? That’s useful, it’s the easy half. Helping developers arrive at yes (the right partition count, the right cleanup policy, the right replica settings) is way more difficult. So the question for measurement becomes: Are your developers reaching the right answer on their own, or are they trying until the gate stops rejecting them?

KR1: Lead time from request to working resource Measure the time it takes from “I need a topic” to “I have a correct, running topic.” In working self-service this is minutes. If it’s hours or days, the gate is technically open but the developer is stuck in front of it: reading docs asking in Slack trial-and-fitting configs against your policy until something passes. A “self-service enabled” with a multi-day lead time means you’ve automated the rejection, not the provisioning.

KR2: Gate rejection rate Track what fraction of create/update attempts your policies reject . Counterintuitively, a high rejection rate is bad news, not proof your guardrails are useful for people. It means developers can’t predict what will pass: they’re trying and failing. Each rejection teaches them only that one specific combination failed. A healthy system has a low rejection rate not because the rules are loose, but because developers start from defaults that already work.

KR3: Repeat questions for the same config Count how often the same questions resurface: “How many partitions should this have?” “Do I want compaction or deletion here?” “Which min-in-sync-replicas goes with replication factor three?” If your platform channel answers these, the knowledge that resolves them is trapped in people instead of encoded in the system. Working self-service makes that expertise “inheritable”: a new team gets it by default, without asking. It’s like Agents Skills: Build once. Reuse for anyone.

Recurring config questions are a direct signal of expertise that hasn’t been turned into a reusable asset.

KR4: Creation from templates What percentage of new topics and connectors are created from a vetted template versus from a blank form and the cluster’s generic defaults?

This is often one thing missing. Every developer has to rethink decisions your experts already made one day and they often fail as they are not Kafka experts. When you have templates and they are massively reused, the expert call gets made once and reused — encoded into a template with context and condition. Template adoption rate is the closest thing you have to a single metric for whether the guidance exists.

KR5: Misconfiguration incidents Find a way to associate production issues to a provisioning-time choice (wrong partitions, wrong metadata, wrong replication, wrong data center, wrong semantics, wrong config, etc.). Partition count is the canonical example: it’s a deliberate capacity decision about parallelism, and...

How to Tell If Your Kafka Self-Service Is Working?

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi