The Humans Become the Bottleneck: A Structural View of AI-Augmented Teams

2026-06-17

A strange thing happens in organizations that introduce AI agents: the faster the AI gets, the more work piles up at the humans.

Faros AI's 2025 study of 10,000+ developers across 1,255 enterprise engineering teams reported that teams with high AI adoption completed 21% more tasks ^{[Faros AI]} and merged 98% more pull requests ^{[Faros AI]} — while review times rose 91% ^{[Faros AI]}, bug rates rose 9%, and DORA metrics remained largely unchanged. The faster the AI ships, the further the human review falls behind. This pattern reproduces across many organizations.

The bottleneck that can't be removed, and the parts that can

Anyone running AI agents will eventually hit the same observation: "the human is the bottleneck."

You cannot fully remove this. As long as design intent originates from a human, the bottleneck is structural. "Make the AI smarter" doesn't solve it.

What you can reduce is the number and quality of decisions that end up at a human.

When an AI agent says "please confirm," the requests fall into two categories.

Genuinely human-only: - Confirming design intent (only this one is intrinsically required) - Conflicts with prior decisions - Cost or scope overruns beyond the assumed envelope

Mechanically detectable but currently routed to humans: - The implementation is broken (type checks and tests can detect this) - The experiment design is wrong (defining verification criteria up front prevents this) - The report has thin evidence (no escalation criteria defined)

The second category, masquerading as the first, is what makes humans the bottleneck. The full implementation comes back saying "please review," and the human spends time on it — when what should have been reviewed was just "are there any constraint violations?"

The "review everything / trust everything" binary

AI agent operations slide easily into one of two failure modes: everything bounces back for human review, or everything proceeds without oversight.

The first overloads humans. The second loses visibility.

The only exit from the binary is to make the escalation criteria explicit.

Can't resolve in 2 minutes → switch approach
Stuck for 15 minutes → return to human
Design change beyond scope → always return to human
Otherwise → proceed autonomously

Put this into the system prompt. "Where the AI should stop, and where it shouldn't" gets defined. Unnecessary interruptions drop. Galileo's human-in-the-loop design guidance is explicit that escalation rates should be derived from your own task distributions rather than imported as generic industry numbers — which is the same point: the rate is a function of how clearly you've defined "where the AI should stop."

Anthropic itself has noted that "approve every action" oversight tends to add friction without delivering meaningful safety gains. The model that works is "monitor while it runs, intervene when needed."

Shift review from "after implementation" to "during planning"

The other structural problem is review timing. If your design is "humans review the code after implementation completes," you will not catch up.

The AI's output speed will not match the human's review speed. Result: review degrades to "looks OK."

The fix is to shift review to the planning and design stage. Before implementation, confirm "is this direction correct?" The implementation details go to the AI. You don't have to read the code, so review is fast. The most expensive failure mode — "we realized the direction was wrong after the implementation is done" — also disappears.

Separate machine-verifiable from non-machine-verifiable, too.

Interface (types, signatures), boundary behaviors — testable and type-checkable, so the machines handle them. Design intent and the "why" — not machine-verifiable. Non-functional concerns (performance, security) — partially testable, partially measurable.

The only thing strictly not machine-verifiable is "the why of the design." Concentrate human time there. Send everything else to machines. Humans converge on the decisions humans actually have to make.

The paradox: introducing AI grows the workload

In February 2026, HBR published a study observing 200 employees at tech companies over 8 months. After AI tool introduction, the workload did not shrink — it grew.

The mechanism they called "workload creep" is simple. AI accelerates tasks → stakeholder expectations on speed rise → more tasks get taken on → workload and density increase. They also found cases where job boundaries collapsed — product managers writing code, designers doing data analysis — and adjacent functions' work was absorbed into existing roles.

ActivTrak's 2025 survey (10,000+ respondents) showed that after AI introduction, time spent on email increased by 104% and time on chat/messaging by 145%.

"Introduce AI and headcount goes down" is half right and half wrong. The mechanically verifiable work genuinely shrinks. But the most essential work — confirming design intent — actually tends to grow. The faster the AI ships complex implementation, the higher the cost of asking "is this implementation actually aligned with intent?"

Optimize the bottleneck, don't try to remove it

Another number from the Faros AI study: even with PRs increasing 98% ^{[Faros AI]}, merge approvals remained largely human-controlled. Stack Overflow's 2025 Developer Survey (49,000+ respondents) found that only 3.1% of developers "highly trust" the accuracy of AI tool output in their development workflow ^{[Stack Overflow (2025 Developer Survey)]} — among experienced developers, that figure drops to 2.5%. Most teams still gate AI-generated code with manual review.

Trust is not yet established, so approval gates remain human-fixed. This is a rational call. It is also a speed constraint.

"How do we remove the bottleneck?" is the wrong question. As long as the source of design intent is human, the bottleneck stays. The right question is "how do we optimize the number and quality of decisions that come back to humans?"

Three things, compounding:

1. Send the machine-verifiable to machines 2. Make escalation criteria explicit 3. Shift review to the planning stage

Stack those, and AI starts to actually lower human load — instead of merely reshuffling where the load lands.

The skill that AI use demands isn't "trust AI more." It is **deliberately designing what humans must decide, and how much of it**. AI is a tool. The design of how to use it stays human work.

References

- [The AI Productivity Paradox — Faros AI (2025)](https://www.faros.ai/ai-productivity-paradox) — primary source for the 21% / 98% / 91% figures across 10,000+ developers / 1,255 teams - [2025 Developer Survey: AI — Stack Overflow](https://survey.stackoverflow.co/2025/ai/) — primary source for the 3.1% "highly trust" figure - [AI Doesn't Reduce Work — It Intensifies It — HBR (Feb 2026)](https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies-it) - [2026 State of the Workplace — ActivTrak](https://www.activtrak.com/blog/2026-state-of-the-workplace/) - [How to Build Human-in-the-Loop Oversight for AI Agents — Galileo](https://galileo.ai/blog/human-in-the-loop-agent-oversight) — referenced for HITL design framing, not for specific escalation rates

---

*Originally published in Japanese at [note.com/nomuraya](https://note.com/notes/nc446e0b6dd8e). Same author writing under "nomuraya / shimajima / 中翔" across media. The English version is adapted, not literally translated.*