Spec Kit Test run: Auditing as you go: what corrections look like

Auditing as you go: what corrections actually look like

“Claude Code might be over-eager and add components that you did not ask for.” is the warning spec kit includes in its own README.

In the previous post, three categories of Constitution-extraction error were shown: hallucination from gap, hallucination from over-tightening and omission. The first two were catchable through self-audit at phase boundaries whereas the third was not.

Spec kit was run on the actual brownfield code base. I made four audit moves. Three were user-driven: I noticed something, drafted an audit and sent it. One was tool-autonomous: the tool retracted its own hallucination during a Constitution Re-Check before my audit was sent. One additional finding, audit-class amplification, came from the tool during one of those user-driven audits: I had pointed at one instance of an error class, and the tool searched its own output for other instances of the same class.

Let us look at this small empirical record of what self-audit looks like when Constitution principles are sharp in detail.

Audit Move 1: The fabricated performance budget

The first hallucination from Post 1’s table was the one-second-per-ten-thousand-papers performance budget: The Constitution stated it as a quantitative principle. Yet, the codebase contained nothing to justify the number. I had never measured performance beyond ‘runs without pause and doesn’t annoy me’ - which is what most of my private projects use as benchmark.

I sent the audit during the spec phase: rewrite the corresponding success criterion (SC-006) as non-regression-against-baseline rather than against a fabricated absolute, amend the Constitution principle in parallel, and treat the fabricated number as the issue rather than the principle’s existence.

The tool’s response, verbatim:

“You’re right that the 1s/10k numbers were ungrounded; I inserted them to give the principle a testable shape but never grounded them in measurement. Treating this as constitution v1.1.0 (MINOR — principle predicate is redefined, but it’s a relaxation, not a removal).”

Five edits landed in parallel: spec FR-017, spec SC-006, the Constitution bumped from v1.0.0 to v1.1.0 with a new Sync Impact Report entry, the CLAUDE.md summary, and the requirements checklist. The new principle rationale, verbatim:

“The criterion is framed as ‘non-regression vs. the current baseline on a personal-library-sized dataset’ rather than as fabricated absolute numbers, because the current implementation has not been benchmarked and any specific bound would be a guess. Refactors are evaluated against measured baseline, not against a fictional target.”

Two things to note: The tool described its own past behaviour mechanically: “I inserted them to give the principle a testable shape but never grounded them in measurement” is the fabrication pattern from Post 1: the Constitution-template required a quantitative slot, but the code base didn’t supply one. Hence, the model produced a plausible default. The propagation across all five artefacts landed on the first response, with the version bump correctly classified as MINOR (relaxation rather than removal).

It cannot be determined whether the tool would have caught the fabrication unprompted. The next audit gives stronger evidence on that question.

Audit Move 2: NOT NULL drift

Two schema discrepancies surfaced in the design phase. The first: the modernised schema declared NOT NULL on most columns. Yet, the legacy create_tables() had bare INT/VARCHAR with no constraint. The second: bib.bibtex was declared NOT NULL alongside an otherwise-correct UNIQUE translation. But the original had no NOT NULL there either.

Both fall into Post 1’s over-tightening category: a pattern read correctly (uniqueness, non-empty content) was inflated into stricter policy than the code actually enforced. I drafted the audit listing the two issues and sent it.

The tool’s response, verbatim:

“Good catch — and there’s a third schema-tightening issue I introduced that you didn’t mention but is in the same class: I added FK constraints on authors_papers(author_id, paper_id) that don’t exist in the original (the original create_tables() declares them as bare INT with no FK). Going to fix all three together since they’re the same kind of mistake.”

Then, the corrections propagated across three artefacts:

Schema: verbatim with the legacy contract: no NOT NULL on the affected columns and no new FKs. The only FK defined was the single existing FK papers.bibtex_id → bib.bibtex_id

data-model.md entity tables: “Required, non-empty” claims had been removed from schema-constraint columns and moved into a separate “Application-level validation” subsection

SQLAlchemy mapping: nullable columns were now typed Mapped[str | None] / Mapped[int | None]. The ORM navigation had been rewritten to use relationship(primaryjoin=..., secondaryjoin=...). Thus, the ORM’s MetaData matches the schema 1:1 and no FK DDL is emitted at...

Spec Kit Test run: Auditing as you go: what corrections look like

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan