Many small agents or one big one? Measuring AI orchestration on a real iOS feature

2026 · JUN 01

Fine-grained vs monolithic

The Agent Dashboard workflow drives Claude Code and Codex agents through a fixed pipeline — plan, implement, review, QA, verify, merge. For a long time its default shape was fine-grained: a planner breaks a feature into N tasks, each task gets its own focused agent in its own git worktree, reviewed independently before the next starts. It's the pattern everyone reaches for — small context per agent, narrow blast radius, a clean review at every step. It feels obviously correct.

On a real iOS feature, replayed nine times, fine-grained turned out to be ~5× slower and ~7× more token-expensive than just handing the whole feature to one agent — with no quality advantage the pipeline's existing review gate didn't already catch on its own.

The cleanest single comparison, same model family on both sides so orchestration is the only variable:

	Fine-grained (12 task-agents)	Monolithic (1 agent)
task_execution wallclock	334 min	70 min
output tokens (impl phase)	1,275,885	174,540
code-review blockers (round 1)	0	4
final status	clean	4 blockers, all fixed cleanly

Fine-grained produced zero first-round blockers; monolithic produced four — but those four were caught by the review step that runs either way, fixed in one clean pass, and the whole thing still finished in a fifth of the wallclock. The four were integration-wiring gaps, not wrong code (detail in the appendix). The "free" per-task review fine-grained gives you is redundant with the explicit code_review step the pipeline already runs.

That's the orchestration axis. The other axis I varied was the model itself.

Sonnet vs Opus

Holding orchestration fixed at monolithic and looking only at the task_execution step — the cleanest model-vs-model cut, with no orchestration or pipeline-step confounds — Opus and Sonnet split along a throughput/cost line:

`task_execution`, monolithic	B-Sonnet	B-Opus 4.7
wallclock	85 min	69.6 min
output tokens	309,775	174,540
relative cost per token	1×	~5×

On the raw numbers it's a clear Opus win — ~18% faster and ~44% fewer output tokens for the same feature. On paper, you'd reach for Opus.

The catch is the last row. Opus bills roughly 5× more per token than Sonnet, and on these cache-heavy runs (cache reads, not output, dominate the bill) the dollar gap is wider than the token counts alone suggest — for the same feature, Opus runs about 10× the dollar cost of Sonnet (full breakdown in the appendix). So the faster-and-fewer-tokens picture inverts in dollars: the feature Opus shipped quicker costs multiples more to ship. The honest framing isn't "Opus is better," it's Opus buys throughput at a real per-token premium — and which model wins depends entirely on whether your binding constraint is wallclock or spend.

Enter Opus 4.8

Opus 4.8 landed just as I was wrapping up the matrix, so I re-ran the monolithic cell on it — same PRD, same Opus reviewer, same everything else. The result was the cleanest kind of upgrade: more reliable at no cost or time penalty.

monolithic, same config	Opus 4.7	Opus 4.8
task_execution wallclock	~70 min	~72 min
first-pass code-review blockers	4	1

First-pass blockers dropped from four to one at essentially identical wallclock — the quality lift was "free" on time, and (same model tier) free on cost too. It's one sample, but a clean enough signal that it changed my default: opus now points at 4.8.

What about Codex

I also ran the monolithic PRD through Codex (GPT-5.5) with a Claude Opus reviewer. The headline is that reasoning effort is load-bearing, not optional — and that, set correctly, Codex was the fastest implementer in the matrix. On the implementation step (task_execution) against the all-Opus run:

monolithic	task_execution	round-1 blockers
Opus 4.7	~70 min	4
Codex @ `effort=medium`	41.5 min	8 (destructive)
Codex @ `effort=high`	~55 min	0

At medium, Codex was faster but far worse — 8 destructive blockers, because it deleted load-bearing code (~3,000 lines of existing tests and several required behaviors) and swapped List for ScrollView, breaking native swipe actions. Bump the one config value to high and the destructive edits vanish: ~55 min and zero first-round blockers — beating Opus on both wallclock and quality, for ~32% more wallclock and ~50% more tokens than Codex's own medium run.

Two takeaways. The medium run wasn't slightly worse, it was categorically dangerous, and the gap was a single config value — so Codex is an effort=high-only tool here. And the cross-model reviewer (Claude Opus reviewing Codex) caught all of it, exactly the blind-spot coverage an implementer/reviewer model split is supposed to provide.

Why I ran this at all: iOS broke a quiet assumption

The Agent Dashboard workflow was designed for a SvelteKit web app: tests ran in milliseconds, builds in seconds, mockable boundaries kept the integration surface small. There, fine-grained's per-task overhead is trivially worth paying — the work dominates the coordination, and the test cycle each agent waits on is basically free. We never questioned whether splitting a feature into a dozen or so task-agents — however many the planner emitted — was the right shape, because the alternative, one agent doing everything, wasn't visibly faster on fast-test code.

Then I pointed the same workflow at fazon, my SwiftUI calorie/macro logger. iOS quietly invalidates four assumptions that make fine-grained cheap:

The simulator is serial. On web/backend, N task-agents run concurrently in N worktrees against N isolated test environments. iOS has no such thing — xcodebuild queues every test invocation onto the same simulator UDID, so two task-agents both running xcodebuild test just wait in line. The parallelism that's supposed to justify fine-grained's overhead is theoretical, not realized.
XCUITest cycles are minutes, not milliseconds. A single gesture-based UI test takes 10–30 seconds; a focused sub-suite is 5–8 minutes; the full suite 20–30. Per-task agents spend most of their wallclock waiting on a runner that only does useful work in bursts.
Compilation is slow and the cache is fragile. Swift compile, code-signing, simulator boot, DerivedData rebuilds — each seconds-to-minutes — and per-task worktrees don't share DerivedData, so switching between them forces incremental rebuilds. The "fresh environment per task" hygiene that's a feature on web is a cost on iOS.
SwiftUI's integration surface is unusually tight. A typical feature touches a view model, a view, a list container, a list row, a context menu, an action sheet, an undo toast, accessibility plumbing, and the parent navigation stack — all of which have to agree about state. Per-task scope keeps an agent focused on one piece, but the cross-cutting wiring that makes the feature actually work is exactly what per-task scope fragments.

The visible symptom that kicked this off: fazon's first big PRDs burned 7.5 hours of task_execution for what is structurally a few thousand lines of view-model and gesture code.

The hypothesis that fell out: fine-grained orchestration amortizes fixed per-task overhead against work that, on iOS, doesn't shrink with task count. The dominant time-sink — test cycles — gets paid once whether one agent does the work or a dozen task-agents split it. So fine-grained multiplies that fixed cost by however many tasks the planner emits, without dividing the variable cost. Monolithic — one agent, full PRD scope, paying the test cost once — should be both faster and about as reliable, because the pipeline's review gates catch integration gaps regardless of how many agents produced them. The runs at the top of this post are what came back when I tested it.

Going forward

The experiment changed how I work day to day, and the short version is simple: Opus 4.8, monolithic, is now the default for everything. Fine-grained is reserved for the rare PRD that genuinely overflows the context budget (roughly >3–4k lines of expected diff at 1M, or >800 at 200K); below that, one agent. xhigh stays on the Opus implementer, the implementer/reviewer model split stays on (it's what caught Codex's destructive edits), Codex is a monolithic-effort=high tool for cost-sensitive one-shots, and Sonnet-monolithic stays "hold" until a run cleanly clears its fix loop. There's an argument to be made for switching to Codex, but I have a Max subscription on Claude and only a Pro for Codex so I'm sticking to Claude for now at least. The fact that I'm using a subscription instead of paying per API call makes me stay on Opus rather than Sonnet.

One caveat I want to state plainly: this is a single case study, not a benchmark. One feature, one project, one platform, one machine, mostly one sample per cell — and several of the differences (plan sizes, exact token counts) sit well within LLM run-to-run noise. The conclusions are directional. The ones I'd actually bet on generalizing are the structural ones — per-task overhead is fixed and roughly language-agnostic, and throughput and dollar cost pull in opposite directions on model choice. The claim I would not port anywhere is "monolithic always wins": on a fast-test web codebase with a wide, loosely-coupled feature, the margin should shrink or flip. Don't copy these numbers to your stack. Copy the method, and measure your own fixed-per-task cost before assuming smaller tasks are safer.

Appendix: setup and results

The feature. fazon PRD-009: per-item edit/delete plus a multi-add picker on the food log. 22 acceptance criteria across view models, views, list rows, swipe/long-press gestures, VoiceOver actions, and English + Swedish localization — real product code, not a benchmark toy. Scale: monolithic landed +1,986 / −491 across 24 files; fine-grained +5,445 / −486 across 46 files for the same feature.

The matrix. Same PRD, same starting commit, four axes: orchestration (fine-grained vs monolithic), model (Sonnet 4.6, Opus 4.7, Opus 4.8, GPT-5.5 via Codex), context tier (200K vs 1M), reasoning effort (high vs xhigh, Opus-only). Held constant: PRD content, starting commit per batch, strictly sequential execution, the operator-intervention recipes and server-side fixes, and an Opus reviewer on every cell — so the blocker counts are comparable. Same Mac, repo, Xcode, and simulator UDID throughout.

Two coupling notes. Context tier is tied to orchestration on purpose: fine-grained tasks fit 200K, but a monolithic agent holding the whole PRD needs 1M, and holding everyone at 200K would have hobbled monolithic into a misleading failure. The "Sonnet baseline" was never pure Sonnet: the mobile-app preset puts Opus on every reasoning step (planning, review, AC verification), so the baseline was Opus-plan/review + Sonnet-do; the all-Opus run moved the doer to Opus too, which is the only genuinely model-constant orchestration pair. One cell never ran — monolithic Sonnet at 1M isn't in my subscription tier — so I dropped it rather than fake it at 200K.

All nine runs. "Blockers" = round-1 code_review findings unless noted. Most runs were deliberately cancelled once the comparison signal was captured, so "why stopped" isn't a crash log. (Plan size is planner-dependent: the same Opus planner drew 11, 12, 16, and 17 tasks from the same baseline across runs.)

Run	Orchestration	Model	Tasks	task_exec	r1 blockers	Notes / why stopped
A-Sonnet (baseline)	fine-grained	Sonnet impl + Opus plan/review	16	450 min	26 (at QA)	doom-loops + compaction; abandoned mid fix-loop
A-Opus	fine-grained	all-Opus	12	334 min	0	clean; cancelled mid-fix once data was in
A-Codex	fine-grained	Codex impl, Opus review	11	43 min (5/11)	—	hit Codex Plus-tier usage window before finishing
B-Sonnet	monolithic	Sonnet	1	85 min	2	abandoned mid fix-loop (compaction + scope-creep)
B-Opus	monolithic	all-Opus, xhigh, 1M	1	70 min	4	integration-wiring gaps, all fixed cleanly; stopped at QA r2
B-Codex (medium)	monolithic	Codex impl, Opus review	1	41.5 min	8 (destructive)	deleted load-bearing code + ~3k lines of tests; terminated post-r1
B-Codex (high)	monolithic	Codex impl, Opus review	1	~55 min	0	clean — effort fix worked; cancelled mid-fix
B-Opus 4.8	monolithic	Opus 4.8 impl + 4.7 review, xhigh, 1M	1	~71.8 min	1	one wiring gap; cancelled at fix boundary
B-Sonnet 4.6 (re-run)	monolithic	Sonnet impl + Opus 4.7 review, 200K	1	64.4 min	0	clean; cut short before fix loop

Core orchestration finding: A-Opus (334 min) vs B-Opus (70 min), same model on both sides — monolithic is 4.8× faster and used 174,540 output tokens against 1,275,885, a 7.3× difference. Per-task duration was nearly identical to the Sonnet baseline (27.8 vs 28.2 min/task), so the win isn't fast-per-task — it's that you stop paying the per-task fixed cost a dozen times.

The four monolithic-Opus blockers (the "is monolithic just sloppier?" question): all four were integration seams, not wrong code — an edit-mode CTA never wired to its commit path, .swipeActions placed outside its List (so it never fired), row sentinels nested in a Button that broke hit-testing, and a hard-delete timer present on one path but missing on another. Each piece was correct in isolation; they're the gaps a single agent leaves while holding 22 ACs in working memory at once — and exactly what the code_review gate exists to catch.

Cost — a correction. The original writeups under-counted dollars by ~30–60× by ignoring cache reads. On long agent runs the dominant billable class is cache reads (I measured ~99% cache-hit ratios), not output. Pricing the token classes separately at standard rates (Opus cache reads $1.50/M, Sonnet $0.30/M; base input $15/M and $3/M):

Step	B-Opus 4.8	B-Sonnet 4.6
task_execution cache reads	141.5M	80.2M
task_execution output	538,829	241,562
billed @ std rates	$280.14	$30.95

So a 70-minute monolithic Opus run is ~$200–300 at standard API rates (~$320 for Opus 4.8 impl+review, ~$31 for Sonnet), not the "about $8" the naive output-only math suggested. The ratios are what travel: Opus ~5× per token, monolithic ~7× cheaper than fine-grained.

Cost is really two dimensions, kept separate: subscription rate-window quota (a rolling 5-hour window metered by model tier — ~6% for Sonnet, ~7% for Opus on impl + review here) and API dollars (the table above). They don't translate into each other; weight whichever is your binding constraint.