Skip to main content

← Back to Journal

Mixing Claude and Codex per agent

Until this week every agent in every workflow step ran on Claude Code. The split inside the model family was familiar: planning-heavy steps — architect, brand, copy spec, PRD review — ran on Opus 4.7; mechanical steps — frontend implementation against a locked design, qa test scaffolding, devops touch-ups — ran on Sonnet 4.7. That split has held since I started running execution rounds at all.

The pressure point was usage limits. As the suite grew, more PRDs landed in execution at once, and the Opus quota burned faster than I expected. The honest version is that I was hitting the cap mid-round more often than I want to admit, and the workaround of "wait an hour" wasn't going to scale to the publishing cadence the studio needs.

I added a kickoff-time option in Agent Dashboard's execution flow this week: when a PRD enters the execution phase, the dashboard prompts for two choices instead of one.

> agent-dashboard run --phase=execution --prd PRD-009
  ↳ implementation: claude | codex
  ↳ review:         claude | codex

The dropdown is per-pass, not per-project. Implementation can run on Codex while review runs on Claude, or the other way round, or both can run on the same model if usage allows. The dashboard records which model handled which pass so a stalled or surprising round is debuggable later.

Two things have come out of this so far.

The first is the obvious one: usage limits stop being the bottleneck. Codex carries its own quota, and the moment the dashboard sees Claude approaching the cap, I can route the next PRD's implementation pass through Codex and keep the queue moving.

The second is more interesting. When implementation runs on one model and review runs on the other, the review pass reads the diff with a different prior. Claude reviewing a Codex implementation flags slightly different things than Claude reviewing a Claude implementation, and vice versa. I can't tell yet whether this catches more bugs than same-model review — I've only got a handful of PRDs through this setup. The shape of the comments differs in a way that feels load-bearing. I'm watching for whether that translates into measurable outcome differences.

The longer-run direction is more model options, not fewer. I want to wire local models in as a third lane — for the agents whose work is small and structured enough that a 7B-parameter model running on the studio's own machine is plausible. The qa agent's "scaffold the E2E tests for this PRD" job feels like the first candidate; it's mechanical, it's well-specified, and it's currently spending Claude tokens on what an open-weights model could do for the cost of electricity.

The architecture inside the dashboard already supports this — agents are spawned as separate processes against a model identifier, and the model identifier is just a string the project config sets. Adding a local-llama identifier alongside claude and codex is hours of work, not days. The blocker is finding a local model that holds the studio's voice on the agents that need to write text — pm, marketing, brand — and that's a research problem, not an engineering one.

For now: per-PRD model choice, per-pass. Cross-model review as a probably-good-thing I'll evaluate when the sample size is bigger. Local models on the roadmap as the next lane.