Spec Kit on a brownfield codebase: setup and first impressions
In April and May I ran the same Spec Kit pipeline against two codebases with the same prompt. One was a small full-stack toy app I had refactored several times before the run, with intentional bugs left in place: sync sessions under async routes, dual ORM mapping, a hardcoded "not implemented yet" description, a header typo. Spec Kit’s pipeline ran through it twice (Opus 4.6, then Sonnet 4.6 in a separate run) and reported all gates passing on completion. Tests claimed passing: 440 in run 1. Tests actually passing on the merged branch: 55, plus one error from a rate-limiter the pipeline added without adding a conftest.py to disable it during tests. External CI on each merged branch: 12 of 20 checks failing.
The second codebase was a three-year-old personal CLI of mine. PostgreSQL backend, Fernet-encrypted config file, around 2000 lines of Python. Real historical accidents, not curated bugs. A typo (bibtext_id) baked into the schema. Three procedural modules from before I bothered to write object-oriented code, sitting alongside the OO stack that replaced them. A placeholder test that always failed. Integration tests silently dependent on a live local database.
The first run mostly confirmed what’s already documented elsewhere, including in Spec Kit’s own README: pipeline self-validation is not a quality gate. The brownfield run was different. The failure mode shifts when the codebase is real, and the interesting hallucinations show up in the Constitution, not in the implementation.
This post is about the Constitution-generation step, because that’s where the pipeline first touches the codebase and what it produces there shapes everything downstream. Later posts cover the audit moves I made during the run, the implementation phase, and where this all does and doesn’t fit in real ERP modernization.
What Spec Kit does first
Before any of the documented pipeline phases (/specify, /plan, /tasks, /analyze, /implement), the Claude Code instance hosting the pipeline reads the codebase. On the brownfield run, that pre-pipeline read produced a six-finding CLAUDE.md document: README-vs-reality mismatch on the run command path, two distinct test problems, the three-layer architecture identified with its driver-isolation boundary, the four-table schema with FK relationships reverse-engineered, three legacy procedural modules flagged as not wired into run.py and as still using the bibtext_id typo, and the constitution recognized as an unfilled template.
I had not asked for any of this, and I had not even issued a pipeline command. The model had read the code on its own and surfaced what it noticed.
For comparison: the curated-toy runs produced no equivalent. There was nothing for a pre-pipeline pass to find that the curated bugs hadn’t already documented for me.
That set the expectation that the brownfield run would surface more, and surface it earlier.
The Constitution
The Constitution is a four-principle document (code quality, testing, UX consistency, performance) that downstream pipeline phases must respect. Each principle has a predicate.
For the brownfield run, the Constitution had to be reverse-engineered from observed code patterns. Six mandates surfaced across the four principles. Four matched real architectural decisions I had made; two had no commit or code evidence behind them. A seventh conscious decision, clearly visible in the commit history, was missing from the Constitution entirely; I’ve added it to the table for comparison.
The author-honesty table
I pulled the commit history for the relevant files and went through it line by line. Some of the patterns the tool surfaced are visible in commit messages as decisions I had documented at the time. Some are not. Here’s the breakdown after the verification pass:
Mandate<br>Origin (verified by commit)<br>Tool behaviour
Type hints everywhere<br>Conscious, Mar 4 2023 commit “add type hints”; Dec 25 2023 commit “add type hints for new methods”<br>Correctly extrapolated
OO architecture<br>Conscious, Mar 3 2023 commit “oop restructure”<br>Correctly extrapolated
Driver isolation (psycopg2 only in psycopg_db.py)<br>Conscious, Mar 20 2023 commit “extract psycopg into separate class - ETC”<br>Correctly extrapolated and elevated to architectural mandate
Pylint-clean<br>Conscious, three separate “satisfy pylint” / “fix pylint” commits across 9 months in 2023<br>Correctly extrapolated
Rollback-on-add-failure<br>Conscious, Apr 23 2023 commit “change adding of database entry to remove incomplete information from DB if add fails”<br>Omitted : not elevated to a Constitution principle
1 second / 10 000 papers performance budget<br>Not conscious; never measured<br>Fabricated to “give the principle a testable shape”
Legacy modules frozen<br>Not decided<br>Fabricated policy
The four “correctly extrapolated” rows have commit evidence. The driver-isolation extraction is the most notable of these: the pipeline correctly identified that I was...