Tests Pass, Design Breaks: Why TDD Can't Hold the Line on Design Intent
There is a popular misconception that if you do TDD, your design also stays correct. That if the tests pass, quality is guaranteed. In AI-assisted development, this misconception is the kind that quietly accumulates — the more tests you have, the more invisible damage builds up underneath.
All tests passed. The design was still broken.
Here is what happened today.
A function called `safe_post.py` had its signature changed. Two arguments — `notify_sh` and `doctor_sh` — were removed. The test suite passed in full.
But the callers were still using the old signature. They were silently broken.
Why did the tests pass? Because the test code itself was using the old signature. The tests had been written (by AI) at a time when the design intent was already misunderstood. The misunderstanding was baked into the tests from the start.
Tests passing and the design being correct are two different things.
"All tests pass" tells you only one thing: the implementation matches what the tests expect. Whether the tests express the right design intent is a separate question.
TDD verifies "implementation against tests" — nothing more
Let me restate the TDD definition.
Red → Green → Refactor. Write a test. Write the implementation that passes the test. Refactor.
In this loop, what the test verifies is whether the implementation meets the test's expectation. That is one verification — and only one.
What TDD does not verify is whether the test itself correctly expresses the design intent.
The structure looks like this:
Design intent → Tests (← this link is not verified)
↓
Implementation (← this link is verified by tests)
If the person writing the tests misunderstands the design intent, the tests will pass and the design will still be wrong. Machine learning engineer Hamel Husain calls this the "Gulf of Specification" — the gap between what you intended to measure and what your metric actually measures. Optimize hard against a flawed metric and you optimize hard in the wrong direction. The same dynamic plays out in TDD.
This is not a critique of TDD. It is a statement that TDD, by its structure, cannot solve this particular problem.
You can't escape the snowball
"Then review the tests," is the natural counter. Yes — but how do you review the review?
Design intent → Tests → Implementation
↑
Human reviews (does it express intent?)
↑
Who reviews the reviewer?
↑
... (infinite regress)
The only way out of this snowball is to design a terminator for the review chain. And the terminator must, eventually, be a human.
The problem is that AI accelerates this loop. AI writes the implementation quickly, writes the tests quickly, makes them pass quickly. The faster the AI side moves, the more "is this test expressing intent?" work piles up on the human side. The paradox is sharp: the more you automate, the more confirmation work humans inherit.
Speed and quality, paradoxically
As a Forward Deployed Engineer working on AI adoption in the field, I run into this paradox often. The pattern goes: "AI made our development faster" — then a few weeks later — "but the design is getting more tangled."
When speed goes up, the share of time allocated to design review goes down in relative terms. When the number of tests goes up, the cost of asking "is this test correct?" goes up with it. Use AI without being aware of this, and the speed benefit converts itself into a quality cost.
What happens when you combine AI and TDD
AI is good at writing tests. "Write tests for this code" — a few seconds, and you have a plausible test file.
That is exactly where the problem is.
The tests AI writes tend to be "tests reverse-engineered from the implementation." They describe what the code currently does. This is excellent for verifying "implementation against tests." It is nearly useless for verifying "tests against design intent."
The reason is simple: AI does not know the design intent. Unless it is in the context, AI reads the implementation, observes the behavior, and turns that behavior into tests. It converts "this is how it currently behaves" into a test, not "this is how it was supposed to behave."
The `safe_post.py` story is exactly this. The tests had been written against the old signature. Nobody noticed. The tests faithfully verified that the implementation matched a now-outdated assumption. After the signature changed, the tests stayed where they were.
The "tests pass = OK" trap
What makes this nasty is that the discovery is delayed.
Normal bugs are caught the moment the implementation fails the test. But "tests don't express the design intent" bugs only surface when the actual runtime behavior diverges from what was intended. From the test output, everything looks fine.
In the `safe_post.py` case, the fact that callers were using the old signature didn't surface until the code path actually ran. From the test suite alone, the answer was "all green."
The one way out
The only way to stop the snowball is to separate what can be machine-verified from what cannot.
Machine-verifiable: - Whether the implementation passes the tests (TDD's job) - Whether signatures and types are consistent (the type checker's job) - Whether boundary conditions hold (automated tests)
Not machine-verifiable: - Whether the tests correctly express the design intent - Whether the implementation's "why" matches the design intent's "why"
Humans only confirm the second category. Everything in the first goes to machines.
If you skip this split and march forward under the belief that "more tests = more safety," every new test adds another item to the "do I trust this test?" pile. Confirmation cost grows linearly with test count.
What a type checker really buys you
In the `safe_post.py` case, the signature change was something a type checker could have caught. With Python type annotations, `mypy` could have pointed straight at the caller using the old signature.
A different layer from TDD. A different mechanism. Widening the machine-verifiable surface is the realistic way to keep design integrity intact. Be explicit about which range tests own, which range the type checker owns, and which range humans own.
Minimizing the human surface
To shrink the human surface, externalize design intent as context.
When asking AI to write tests, lead with the intent. Not "write tests for this function" but "this function's responsibility is X and Y; it does not handle Z; please write tests that verify those two." When you change a signature, write: "this function's responsibility now excludes the notification side; tests should reflect that exclusion."
Even then, misunderstandings happen. But the divergence between intent and generated test is smaller than when you hand AI nothing but implementation code.
Not a criticism of TDD
To be clear: I am not against TDD.
Tests are necessary. Automated tests are the only practical way to verify boundary conditions. They are the only mechanism that can flag "did this signature change break the callers?" — provided the prerequisite holds, that the tests themselves correctly express the design intent.
The problem is the belief that "if you do TDD, your design is also safe."
TDD is a tool that raises implementation quality. It is not a tool that verifies design intent. Use it with that distinction in mind, and TDD becomes a powerful weapon. Confuse the two and you get a state where "confidence rises but the actual coverage of quality assurance shrinks."
In AI-assisted development this distinction matters more, not less. The faster AI can generate tests, the more the gap between "tests written" and "intent verified" widens — unless you deliberately design the mechanism that closes it.
A three-layer model for test design
A practical organizing frame:
**Layer 1: implementation correctness (TDD)** Tests carry expectations; the implementation must satisfy them. Red/Green/Refactor. The layer AI is best at.
**Layer 2: design integrity (types / static analysis)** Signature consistency, type matching, contracts with callers. Type checkers and linters do this. Machine-owned.
**Layer 3: alignment with design intent (humans)** Whether the test truly expresses "why this should behave this way." Whether the implementation's "why" matches the design intent. Humans only.
When AI accelerates test generation, Layers 1 and 2 stay machine-owned. Build the discipline of confirming only Layer 3 by human. That is the realistic design for keeping speed and quality together.
Why "verbalizing design intent" is the core skill of the AI era
The conversation broadens slightly from here.
As AI-assisted development accelerates, the value of being able to articulate design intent rises.
The cost of writing code has dropped. The cost of writing tests has dropped. Both can be generated in seconds. But "what should we build?" and "why does this design have to look like this?" — these AI does not figure out for you. More precisely: unless you put the intent into the context, AI defaults to "the design inferred from the current implementation."
A person who can verbalize design intent gives AI more concrete instructions. "This function's responsibility is X and Y. Z is out of scope. Tests should verify these two." Hand AI that, and the gap between intent and generated tests shrinks.
A person whose design intent lives only in their own head, hands AI nothing concrete. Every confirmation step boomerangs back to the human. When the design intent is not verbalized, the faster AI goes, the more confirmation cost the human inherits.
I see this pattern more often in the field now: "we introduced AI, development sped up, but quality confirmation has become exhausting." The "exhausting" part is mostly the design-intent verbalization gap. Speed exposes what was tacit.
TDD does not guarantee design intent for the same reason AI does not guarantee design intent. Both are tools that process what is written. Design intent, unless humans put it into writing, lives nowhere a machine can read it.
Where to write the design intent
A concrete question: where should you put it?
**In code, via test names.** Not in comments, in the test name itself. The test name is the place to say "what should this implementation be doing, and why." `test_safe_post_handles_missing_file` says less than `test_safe_post_completes_without_notify_when_notify_sh_is_absent`. The longer name carries the intent.
**In documents, via ADRs (Architecture Decision Records).** Why you chose this design, what alternatives existed, the assumptions behind the choice. You do not need perfection. A single paragraph — "the current signature is X and Y for these two reasons" — drastically lowers the cost of judging a future signature change.
**In conversation, via PR comments and issue threads.** A code review comment that carries design intent becomes a future tracer for "why is it like this?"
The common move across all three: externalize design intent. Do not keep it in your head. Put it where a machine can reference it.
Conclusion
There is no shortcut to verifying design intent. The region machines cannot handle stays with humans.
What you can do is shrink the human region. Automate the machine-verifiable side aggressively. Confirm only what is left.
Not "tests pass, so we are correct." But "did I confirm that the tests express the design intent correctly?"
TDD is a powerful tool. Use it with a clear sense of what it covers and what it does not. Without that distinction, the faster AI development gets, the more quietly things break underneath.
That is the lesson from today.
References
- Hamel Husain, "Your AI Product Needs Evals" (2024) — origin of the "Gulf of Specification" concept - Jeannette Wing, "Computational Thinking" (2006, Communications of the ACM)
---
*This post was adapted (not literally translated) from a Japanese original at [nomuraya-hub.pages.dev](https://nomuraya-hub.pages.dev/). I am the same author writing under different pen names — "nomuraya / shimajima / 中翔" — depending on the medium.*