AI Agent Benchmark: API Bug Detection | KushoAI
Executive Summary
AI coding tools can generate API tests quickly. The harder question is whether those tests find bugs.
This report evaluates that question across 20 live API scenarios spanning seven application domains, with 97 planted functional bugs across three complexity tiers. Each system receives only a JSON schema and one valid sample payload, then must generate API test cases that expose failures in a live reference API. No source code. No documentation beyond the schema. No hints about where failures are planted.
The evaluation uses APIEval-20 v1.0, a black-box benchmark contributed by KushoAI. Because KushoAI is also one of the evaluated systems, this report includes the methodology, workflow definitions, repeated-run setup, and robustness checks so readers can understand where the performance difference comes from.
Seven systems were compared across three groups: general-purpose LLMs, coding agents, and KushoAI. The report also compares workflow modes engineering teams commonly try in practice: one-shot prompting, structured test-strategy prompting, prompt chaining, native coding-agent workflows, and native API test generation.
Simple structural bugs are no longer a meaningful differentiator. Most systems can generate missing-field, null, empty-array, and wrong-type tests. These tests are useful, but they are also the easiest class of failures to discover from the schema alone.
Prompt engineering improves parameter coverage, formatting, and field-level negative tests. It makes suites broader, more explicit, and easier to parse, but it does not consistently make general coding tools reason about cross-field business states.
The gap opens on complex bugs. KushoAI detects 76% of complex planted bugs in this evaluation, compared with 53% for the strongest coding-agent workflow and 34% for the strongest general-purpose LLM.
KushoAI ranks first on the primary score and across all bug complexity tiers. The largest margin appears on the metric most tied to production risk: cross-field and business-logic bug detection.
Key Findings
1. Plausible-looking test suites can still miss bugs.
Several systems generated suites with readable names, valid payloads, and broad field coverage. The difference appeared only after running those suites against live APIs with known planted failures.
2. Simple schema-level tests are now table stakes.
Most systems can generate tests for missing fields, null values, empty arrays, and wrong types. These tests are useful, but they are not enough to evaluate whether a tool can find production-relevant failures.
3. Prompting helps breadth more than depth.
Structured prompts improved parameter coverage, JSON validity, and field-level negative tests. They did not consistently produce cross-field business-logic tests.
4. Complex bugs separate field mutation from API test design.
The hardest bugs required combining individually valid fields into invalid states, such as invalid refund state, role hierarchy violations, conflicting recurrence rules, or notification channels enabled before verification.
5. Test composition matters more than test volume.
Coding-agent workflows often generated many tests. The gap came from whether those tests explored meaningful field interactions.
6. Consistency matters for CI/CD adoption.
KushoAI showed the lowest run-to-run variance among all evaluated systems. For teams integrating generated tests into automated pipelines, output stability matters as much as peak performance.
Why API Bug Detection Needs a Different Evaluation
Most API test generation comparisons ask whether a tool can produce tests. That is too low a bar. Any current LLM can generate a list of plausible tests from an API schema.
The test names may sound comprehensive, and the payloads may be syntactically valid, but that does not tell an engineering team whether the suite actually reduces risk. Traditional coding benchmarks usually measure properties like code correctness, task completion, or whether generated tests execute. API testing has a different core objective: finding behavior that violates the intended contract of a live service.
A more useful evaluation question is narrower: Given only the request schema and one valid sample payload, with no source code, no documentation beyond the schema, and no hints about planted failures, can an AI system generate tests that trigger planted functional bugs in a live API?
That is the task evaluated in this report. It evaluates end-to-end behavior: the agent reads the schema and sample, constructs a test suite, the suite is executed against live reference implementations, and scoring is determined by which planted bugs are triggered.
This black-box constraint reflects a common practitioner reality. Teams often receive an OpenAPI schema or request payload examples before they have complete documentation, test data, or implementation context. In that setting, a...