Why Spec-Driven Development Is Replacing Vibe Coding

By AI Agent Engineering | 2026-03-11 | guide

A deployment broke at 2 AM last Tuesday. The on-call engineer pulled up the diff, traced the failure to an auth token validation change, and realized something unsettling: nobody on the team had written that line of code. An AI agent had generated it three days earlier while adding a new feature — and in the process, silently dropped a constraint from the original authentication flow.

This is the failure mode nobody warned you about when vibe coding felt like a superpower.

The Forgetting Problem

Vibe coding — building software by feeding prompts to an AI one at a time — delivers an intoxicating first week. You describe what you want, working code appears, and every feature ships faster than it should. The trap springs on week three.

When you prompt for Feature B, the model has no memory of the constraints you established for Feature A. Add Feature C, and something foundational breaks — not loudly with an error, but silently, in a way that passes every test you wrote for Feature C while violating an assumption Feature A depends on. Teams across Amazon's internal services hit this pattern repeatedly: prompt-by-prompt development produced a "two steps forward, one step back" dynamic on any codebase with history [1].

The math exposes the trap. If a developer previously spent 70% of their time writing code and 30% on process — testing, review, deployment — and AI doubles coding speed, the ratio shifts to roughly 50/50. The coding got faster. The overall delivery speed barely moved. Everything around the code stayed exactly the same speed [1].

This isn't an AI limitation. It's an architecture problem. Prompts are disposable by design — each one exists in isolation, unaware of what came before. Building complex software on disposable instructions is like navigating a city with turn-by-turn directions that reset after every intersection. It works for three blocks. Then you're lost.

The Map, Not the Directions

The fix isn't to abandon AI — it's to change what you hand it. Instead of a sequence of disposable prompts, you write a specification: a living document that describes what the software should do, who it serves, and what properties must always hold true.

The spec becomes the central artifact of the project. Not the code. The AI generates code from the spec, generates tests against the spec, and — critically — retains the full context of every requirement when you ask it to add a feature. Update the spec first, rebuild second. Nothing gets forgotten because nothing was ever in a prompt that disappeared.

In practice, the workflow at teams using this approach follows six steps [1]:

  1. Define requirements with stakeholders — AI can draft, but humans decide what matters
  2. Write the spec — what the software does, its API contracts, its invariants
  3. Make architecture choices — language, frameworks, deployment model (decisions AI can't make for you)
  4. Generate code and tests from the spec, stepping through sections with checkpoints
  5. Validate against real users — does it solve the actual problem?
  6. Update the spec with what you learned, then regenerate

The spec doesn't need to be a 50-page formal document. A structured markdown file works. What matters is that it's explicit, versioned, and always reflects the current truth about what the software should be doing. Teams that adopted this approach internally at AWS found that the initial overhead — writing things down before generating code — paid back immediately in eliminated regressions [1].

Property-Based Testing Changes the Economics

Here's where specification-driven development unlocks something that prompt-by-prompt coding fundamentally cannot: property-based testing.

Traditional testing checks specific cases. "If input is X, output should be Y." You write ten of these, feel thorough, and miss the eleventh case that breaks production on a Saturday.

Property-based testing works differently. You define properties — invariants that must hold true across all inputs: "every API response either returns valid data or a structured error," "no connection attempt proceeds without an auth token," "no request takes longer than the SLA threshold." A testing engine then generates hundreds or thousands of random inputs and verifies each property holds across every one of them.

When AWS built drivers for Aurora DSQL using this approach, they extracted properties directly from their specification — like "every connection attempt contains an authorization token" — and the testing engine automatically explored permutations no human would think to try [1]. The result: bugs that previously survived to production, discovered only through customer reports, started getting caught in the build pipeline.

This pairs naturally with specs because the properties already exist in the document — they just need extraction and formalization. The AI writes your code AND generates your edge cases. You define the rules once, and every future code generation is automatically validated against all of them. The silent regression that shipped last month? Caught in the next build, before anyone opens a PR.

Code Review Becomes Spec Review

If AI generates the code, what's the value of reading it line by line?

The emerging answer from teams deep into this workflow: not much. Reviewing AI-generated code is heading the same direction as reviewing compiler-generated assembly — something engineers did in the 1960s and gradually stopped doing because reviewing the higher-level artifact was more productive [1].

What replaces line-by-line review is a two-layer system. First, a separate AI agent reviews the generated code — not the same agent that wrote it. Using the same agent for writing and reviewing is like grading your own exam. A different agent, with different instructions and a mandate to check for quality, security, and spec adherence, produces measurably better outcomes. Early evidence suggests that even switching to a different model for the review step improves catch rates [1].

Second, human review shifts to the spec itself. Is this specification correct? Does it capture what the customer actually needs? Are the properties comprehensive enough? These questions require judgment that no model can substitute — and they're the questions that matter most for the software's long-term success.

Context Is a Design Problem

The hardest challenge in AI-assisted development right now isn't model intelligence — it's context management. Feed the model too little context and it makes wrong assumptions. Feed it too much irrelevant context and output quality degrades, the same way you'd confuse a barista by ordering a latte while also mentioning the weather and last night's game scores [1].

The solution turns out to be principles software engineers have championed for decades: modularity, clean APIs, strong typing, and encapsulation. If your codebase is well-modularized, the AI only needs context about the module it's changing. If your APIs have clear contracts, the AI reasons locally instead of needing to understand the entire system.

There's a diagnostic test for this: when you ask your AI agent to add a small feature, how many modules does it need to touch? If the answer is "all of them," you have a design problem that's throttling both human and AI productivity. If it touches one or two modules, your architecture will scale with AI tooling for years [1].

Languages with strong type systems accelerate this further. Rust's compiler catches contract violations before they become runtime bugs. TypeScript's type annotations constrain the solution space. Even Python's optional type hints, when used consistently, give the AI enough structure to reason about interfaces without loading entire codebases into context. Amazon's internal teams found that Rust's compile-time checking was particularly effective at catching bugs early in AI-generated code — the type system acts as a second specification that the compiler enforces automatically [1].

When Vibe Coding Is Exactly Right

This isn't a funeral for vibe coding. It's a boundary marker.

Prompt-by-prompt development remains the fastest way to prototype an idea, explore an unfamiliar API, or build a script you'll run once and delete. For greenfield projects under a few hundred lines, the AI's context window holds everything and nothing gets forgotten. Vibe coding is a legitimate tool — the mistake is treating it as the only tool.

The inflection point is predictable: when you start copying context between prompts to remind the AI what it already built, you've outgrown vibe coding. When you add a feature and something unrelated breaks, you've outgrown vibe coding. When you realize you're spending more time debugging regressions than building features, the spec earns its keep [1].

The loading cost is real but modest. Writing a spec takes more time upfront than firing off a prompt. But the teams that have made this transition report that the investment pays back within the first iteration cycle — because every subsequent change builds on a foundation that remembers everything, instead of starting from a blank context window [1].

The Role Shift

None of this shrinks the engineer's job. It changes what the job is.

Engineers who've moved past the vibe coding phase spend less time typing code and more time on the work that actually determines whether software succeeds: understanding customer problems deeply enough to specify them precisely, making architecture decisions that AI can't make (should this be a microservice or a library? Rust or TypeScript?), and building the testing and validation infrastructure that makes AI-generated code trustworthy [1].

The constraint on software has always been supply. There has never been enough good software to meet demand. If AI drives down the cost of production by 10x, the economic impact of software could expand by 100x [1]. The engineers who thrive will be the ones defining what code should exist and why — not the ones typing it into existence character by character.

The spec is how you express that "what" and "why." It's the artifact that captures your understanding, survives across iterations, and scales with complexity. The prompt was never designed to do any of those things.

The teams building this way today are writing the patterns everyone else will follow in two years. The question isn't whether the shift happens — it's whether you'll be setting those patterns or catching up to them.


References

[1] IEEEComputerSociety — SE Radio 710: Marc Brooker on Spec-Driven AI Dev — (2026-03-04). Video