Spec-Driven Development: origin, methodology, tools, and what the hype doesn't tell you
tl;dr — SDD is a legitimate response to a real problem with AI agents. The concept makes sense, the tools are immature, the cost is high, and ROI evidence is still almost entirely anecdotal. Worth understanding. Worth being skeptical about.
The problem that started it all
To understand why Spec-Driven Development emerged, you need to take a step back.
In 2025, AI agents like Claude Code, Cursor, and Codex stopped being glorified autocomplete and started executing complex tasks autonomously. With that came the phenomenon of vibe coding: you describe what you want in natural language, the agent generates hundreds of lines of code, and you approve without fully understanding what was done.
The result? Code that works until it needs to be maintained. Invisible architectural decisions. Technical debt accumulating at industrial speed.
The most rigorous study on the subject — conducted by METR in 2025 with experienced developers on real open source projects — concluded that developers using AI tools were, on average, 19% slower than without them. Not faster. The reason: unstructured prompts created debugging loops that consumed all the time saved in initial code generation.
Spec-Driven Development is the industry’s attempt to solve that problem at the root.
Where the idea comes from
The idea of writing specifications before writing code is not new. It is as old as software engineering itself.
1963: Margaret Hamilton, managing the software for NASA’s Apollo missions, coined the term “software engineering” because programs had grown beyond any one person’s ability to fully comprehend them. She realized: this is engineering, it needs process.
1968: NATO organized a conference in Berlin that formally identified the Software Crisis: computers now made it possible to write programs so complex that they could not be managed adequately. The volume of code had surpassed the human capacity to reason about it.
1972: Dijkstra, in his Turing Award lecture, summed it up: “As long as there were no machines, programming was no problem at all. When we had a few weak computers, programming became a mild problem. Now we have gigantic computers, and programming has become an equally gigantic problem.”
The response of the era was process: Waterfall as a DoD standard, then Agile in 2001, then CI/CD in the cloud making Agile viable at scale.
Now we are in the next cycle. Drew Breunig, the researcher who popularized the term SDD in 2026, described it precisely: “Our current software crisis is our inability to manage the complex codebases that new models enable. Before, the problem was that we couldn’t keep all the code in our heads. Now we can’t even read all of our code.”
AI agents enable waterfall-level volume at an agile cadence. That is the problem SDD attempts to address.
What is Spec-Driven Development
SDD is a development methodology that treats specification as the primary artifact — not code.
Instead of the traditional cycle of prompt → code → iteration, the flow becomes:
Spec → Plan → Tasks → Code
The spec defines intent, constraints, acceptance criteria, and architecture before any implementation. The AI agent then executes against that structured input rather than interpreting a vague description.
An important distinction: with tools like Spec Kit, Kiro, and Tessl, the spec itself is generated by the AI. You describe the goal in natural language, and the agent produces the specification files — typically requirements.md, design.md, and tasks.md — that it will then use as context during implementation. The spec is not a document written manually upfront by analysts; it emerges from the conversation between developer and agent, before any code is written.
This changes the diagnosis of classic problems with specifications: the distance between spec and code is no longer a matter of weeks or different teams. Spec and code are generated in the same cycle, by the same tool, minutes apart. The problem that persists is not temporal separation — it is the upfront work, the token cost, context degradation across iterations, and what happens after deploy when reality diverges from what was specified.
There are different levels of SDD adoption:
Spec-first: the spec is generated before implementation and used as a guide for that task. Once complete, it is discarded.
Spec-anchored: the spec is retained after the task to guide future evolution and maintenance of the system. It is a living artifact that travels alongside the code.
Spec-as-source: the most ambitious level. The spec is the source code. The generated code is merely a compiled artifact of the spec, marked with // GENERATED FROM SPEC - DO NOT EDIT. Tessl is attempting to make this viable.
The SDD Triangle — and why the cycle is harder than it looks
The most honest contribution to the debate came from Breunig in March 2026, after building the whenwords project (an open source library with zero code — only spec + 750 conformance tests) and observing how similar projects evolved.
His central insight: SDD is not a linear equation. It is a feedback cycle.
He proposes the SDD Triangle: spec, tests, and code are three nodes that must stay synchronized at all times. When the code advances, the spec must be updated. When the spec changes, new tests need to be written. When tests fail, the code needs to change — and sometimes the spec was wrong too.
The problem is that keeping these three nodes synchronized is hard:
- Writing specs is hard. They are never exhaustive and are written before the software encounters the real world.
- Writing tests is hard. Even before agents, nobody enjoyed writing tests.
- Updating specs and tests after implementation feels like overhead, especially when you are using agents precisely to move fast.
- LLMs make silent decisions during implementation. Those decisions rarely find their way back into the spec.
The practical result: the spec is written, the code is generated, the product is shipped. The spec becomes stale within days. Nobody goes back to update it. The original problem — lost intent, invisible decisions — has simply migrated to a different level.
The tool ecosystem
The SDD tool space exploded between late 2024 and early 2026. It helps to understand the layers:
Layer 1 — Spec frameworks: define and manage specification artifacts → Spec Kit, Tessl, Kiro, BMAD, OpenSpec, cc-sdd
Layer 2 — Planning and task systems: convert specs into executable task graphs → Taskmaster, Agent OS, Beads, Feature-Driven-Flow
Layer 3 — Execution agents: write and modify code → Claude Code, Cursor Agent, Codex, Devika, OpenDevin, CrewAI
Layer 4 — AI IDEs: integrate all layers into a single workflow → Kiro, Windsurf, Cursor, Claude Code, Copilot
Most developers today use only Layer 3 — which is exactly where the vibe coding problem lives.
Table of relevant tools
| Tool | Type | Approach | Status |
|---|---|---|---|
| Spec Kit (GitHub) | CLI | Constitutional spec + 4 phases | Open source, GA |
| Kiro (AWS) | IDE | EARS notation, 3 documents | GA, free tier |
| Tessl | CLI + Registry | Spec-as-source (most ambitious) | Closed beta |
| BMAD | CLI | Multi-agent, role-based personas | Open source |
| OpenSpec | CLI | Proposal + approval workflow | Open source |
| Plumb | CLI | Spec/test/code sync via git hooks | PoC (pip install plumb-dev) |
| smart-ralph | CLI | Minimal SDD scaffold | Open source |
How to use it in practice: Kiro as an example
Kiro is the most accessible entry point for developers already using VS Code. It is a fork of Code OSS (the open source core of VS Code) built by a small team inside AWS, deliberately positioned outside the AWS ecosystem — you do not need an AWS account to use it.
Installation
Go to kiro.dev and download the installer for your operating system. Sign in with GitHub or Google.
The three-step workflow
Step 1 — Requirements
You describe what you want to build in natural language. Kiro translates that into EARS notation (Easy Approach to Requirements Syntax):
WHEN user submits login form
AND credentials are valid
THEN system must authenticate the user
AND redirect to the dashboard
AND log the login event with timestamp
This notation enforces explicit, machine-readable constraints. You review and adjust the generated requirements.md before moving forward.
Step 2 — Design
Kiro analyzes your existing codebase and generates a design.md with architectural decisions, stack choices, and component structure. For a React + Node.js project, you will see something like:
## Architecture
Frontend: React 18 with React Router v6
Backend: Express 4.x with JWT middleware
Database: PostgreSQL via Prisma ORM
Auth: bcrypt (salt rounds: 12) + JWT (access: 15min, refresh: 7d)
Testing: Jest + Supertest for integration
You review this document before any code is written. Disagreements with the actual project architecture surface here — not after 400 generated lines.
Step 3 — Tasks
Kiro generates a tasks.md with discrete implementation steps, sequenced by dependency:
- [ ] Task 1: Database setup and user schema
- [ ] Task 2: POST /auth/register endpoint with Joi validation
- [ ] Task 3: POST /auth/login endpoint with token generation
- [ ] Task 4: JWT authentication middleware
- [ ] Task 5: POST /auth/refresh endpoint
- [ ] Task 6: Integration tests for all endpoints
You control which tasks to execute and when. The agent implements one task at a time, with review checkpoints in between.
Steering files
In addition to the three spec documents, Kiro supports steering files — persistent configuration files that define standards for the entire codebase:
# .kiro/steering/code-style.md
- Use strict TypeScript, no implicit `any`
- Prefer `async/await` over callbacks
- Variable names in camelCase, constants in SCREAMING_SNAKE_CASE
- Public functions must have JSDoc
Hooks
Kiro supports event-based hooks — agents that trigger automatically in response to defined events:
{
"hooks": [
{
"name": "security-audit",
"trigger": "on-save",
"agent": "Check the saved file for security vulnerabilities"
},
{
"name": "test-generator",
"trigger": "on-file-create",
"pattern": "src/**/*.ts",
"agent": "Generate unit tests for the new file"
}
]
}
What works well
- Reviewing a design document before implementation is fundamentally different from reviewing 500 generated lines of code afterward. Problems surface at the right level.
- Steering files eliminate a huge volume of repetitive prompts about style and conventions.
- The sequential task flow keeps the agent’s context focused — one problem at a time, not an entire system.
What does not work well
- Kiro generated 16 acceptance criteria for a simple bug fix in independent tests. For small changes, the overhead is real.
- Spec generation for moderately complex features takes 30–45 seconds. That burns in a fast development flow.
- EARS notation has a learning curve. It is not intuitive for anyone who has never worked with formal specifications.
- Kiro uses the Open VSX extension registry (not Microsoft’s), which means no official C# support — a serious limitation for .NET teams.
- Autopilot mode (executing multiple tasks without supervision) produces less predictable results. The per-task approval flow is where the real value lies.
Pricing
During the public preview: free (with interaction limits). GA: Free (50 interactions/month) · Pro $19/month (1,000 interactions) · Pro+ $39/month (3,000 interactions)
The real problems
1. Upfront work that goes against developer instinct
The SDD with AI promise is that the spec is generated quickly — and it is. But “quickly” does not mean “free.” Before a single useful line of code, you go through multiple review cycles: reviewing the generated spec, correcting misinterpreted intent, adjusting the architectural plan, approving or rejecting tasks. Each cycle demands real human attention.
For a medium-complexity feature, that overhead can be less than the cost of debugging vibe-coded code afterward. For a targeted bug fix, the cost exceeds the benefit by a wide margin — Kiro generated 16 acceptance criteria for a simple fix in independent tests. The upfront work does not disappear because the AI generates the spec; it changes in nature. Instead of writing, you review and decide. It is less costly, but it is not zero.
2. Token cost multiplied across every phase
Each SDD phase (spec → plan → tasks → implementation) consumes tokens before a single line of production code is written. With reasoning models, agentic usage can be 100x greater than standard usage.
Heavy SDD sessions with Claude Code hit the context limit regularly. The automatic compaction process takes 3 to 12 minutes. The cost is not only financial — it is also time and flow interruption. No public benchmark compares the total cost (tokens + human review time) of SDD versus direct development.
3. Context degradation across iterations
This is the least discussed problem and perhaps the most serious. AI-generated specs are fed back to the AI in the implementation phase. Each compaction cycle or session restart loses nuance. The agent implementing the code does not have full access to the decisions the spec-writing agent recorded.
The result: the code begins to subtly diverge from the spec as context degrades. The loop between spec and code, which should be synchronous, becomes a game of telephone — each phase amplifies small noise from the phase before.
4. Specs become misleading after deploy
With modern SDD tools, the spec is not “distant from the code” — it is generated in the same cycle, by the same tool, minutes before implementation. The classic problem of specs written by analysts disconnected from technical reality does not apply here.
The problem that persists is different: what happens after deploy. Edge cases only appear in production. Real user behavior diverges from what was modeled. Performance problems emerge under load. The spec is not updated. If it is used to guide future maintenance and evolution (spec-anchored), a stale spec becomes actively misleading: the agent trusts it, generates code based on a reality that no longer exists, and the developer only notices when the system breaks.
Breunig’s Plumb tool attempts to address this: a CLI that hooks into git commit, reads agent traces, extracts decisions made during implementation, and asks for developer approval before updating the spec. It is a PoC, not production — but it points in the right direction.
5. Artifact proliferation without real curation
Each feature generates multiple markdown files. The implicit promise that “the agent keeps the spec updated” does not hold — someone needs to review every spec change with the same rigor applied to code review. In practice, that does not happen. Specs accumulate, drift from the code, and become noise rather than signal.
6. Specifications do not eliminate non-determinism
The same spec can produce different implementations across different runs. Greater precision reduces variation but increases the cost of writing. And poorly written specs — the most likely outcome when the methodology is new to the team — produce well-organized code that does the wrong thing.
7. Risk of waterfall with AI in the loop
The most structural criticism comes from Thoughtworks and independent analyses: SDD as currently practiced risks being waterfall with AI in the loop. You are still defining everything upfront and hoping reality cooperates. For exploratory development with genuinely unknown requirements, context-driven approaches adapt better.
When it makes sense to use
SDD in its current form makes sense for:
- Enterprise teams developing on large, existing codebases where architectural drift is expensive
- Regulated environments where audit trails and requirements traceability are mandatory (EU AI Act, financial sector, healthcare)
- Stable domains with clear contracts: APIs, data schemas, compliance rules
- Teams with TDD or BDD maturity who want to extend that discipline to the AI layer
- Emulation and portability: the use case where SDD shines brightest is reimplementing an existing system in another language, using the original system’s tests as the specification
SDD is probably overkill for:
- Solo projects and rapid prototypes
- Exploratory development where requirements are genuinely unknown
- Small fixes or isolated bugs
- Teams without the discipline to keep specs updated after implementation
The honest diagnosis
SDD is a legitimate response to a real problem. Vibe coding generates code faster than teams can govern. Specifications are a tool for restoring that governance.
But the parallel with TDD is instructive. TDD is 25 years old, has extensive empirical evidence, and real-world adoption sits around 8% in the strict sense (writing tests before code, consistently). SDD is generating hype because it solves a visible problem in the agent era, but it inherits the same fundamental challenge: developers prefer to ship. Any methodology that adds upfront work will fight against that instinct.
The absence of easily findable public criticism is itself a signal. When the benefits are trivial to find and the tradeoffs are not, a methodology is still in its marketing phase, not its maturity phase. Spec-Driven Development has not yet paid that entry fee.
The tools worth watching: Spec Kit for open source flexibility and IDE independence, Kiro for structured IDE workflows, Tessl for the more ambitious spec-as-source vision (still unproven), and Plumb as a reference for the hardest and still unsolved problem — keeping specs alive after the initial implementation.
References
- The Rise of Spec Driven Development — Drew Breunig
- The Spec-Driven Development Triangle — Drew Breunig
- Spec-Driven Development Is Eating Software Engineering — Vishal Mysore
- Spec-Driven Development: Unpacking 2025’s key new practice — Thoughtworks
- Spec-Driven Development: When Architecture Becomes Executable — InfoQ
- Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl — Martin Fowler’s blog
- The Limits of Spec-Driven Development — Isoform
- GitHub Spec Kit
- Kiro IDE
- Tessl
- Measuring AI Impact on Experienced Developer Productivity — METR, 2025