Codex (GPT-5.5) vs Claude Code (Opus 4.7): Driver/Worker Guide

The pattern that has held up across complex refactors, full WordPress migrations, and ground-up SAAS rebuilds is hierarchical: Claude Code (Opus 4.7) is the driver. Codex (GPT-5.5) is the worker. Claude Code plans, calls Codex to do the heavy execution, gets the results back, reasons over them, decides what’s next.

The version stamps matter for an article like this. Opus 4.7 launched April 16, 2026. GPT-5.5 launched April 23, 2026. The framework we currently run on top of them — BEADS with Metaswarm v0.11.0 — landed mid-April.

The Quick Verdict

Workload	Where it lives	Why
Planning, architecture, ambiguous specs	Claude Code (driver)	Long-context coherence, self-verification sub-agents
Long terminal runs, mechanical execution	Codex (worker)	Sustained 45+ minute runs, ~72% fewer output tokens
Reasoning over returned work, integration, review	Claude Code (driver)	Review is folded into the driver’s loop, not a separate step
Single-tool work that fits in one context window	Either, alone	Driver/worker overhead doesn’t earn its keep

Benchmark anchors: Lushbinary, April 2026, cross-checked against FwdSlash.

What Each Is Specifically Better At (April 2026)

Where Claude Code (Opus 4.7) Wins

Practitioners running both consistently describe Claude Code as the tool for the thinking work: the ambiguous problem, the large codebase, the architecture decision that will outlast the session. Chandler Nguyen’s follow-up post in late April put it plainly after weeks of running both: “Codex took the coding seat and Claude Code took everything else.” The “everything else” covers planning, comprehension, reviewing what came back from the worker, deciding when something is actually done.

The benchmarks line up with that read. Opus 4.7 leads on SWE-bench Pro at 64.3%, SWE-bench Verified at roughly 87.6%, CursorBench at 70%, and GPQA Diamond at 94.2%. Two operational features show up in daily use beyond what those numbers capture: CLAUDE.md persistent project context (so the agent re-loads architecture decisions across sessions), and what Chandler called the killer feature, the harness spawning verification sub-agents without being asked. On long sessions, especially over 90 minutes of continuous work on the same problem, it holds the thread better than alternatives we’ve tested.

Claude Code’s token consumption is roughly 3-4x higher than Codex CLI on equivalent tasks. The harness is doing more (context preloading, sub-agent spawning, automatic verification passes) and you pay for that in tokens. For deep work, the cost is justified. For high-volume mechanical transformations, it isn’t. That gap is most of why the driver/worker split makes sense.

Where Codex (GPT-5.5) Wins

Among practitioners running both, Codex is where the long execution lives. It runs hard for stretches Claude Code wouldn’t sustain. Chandler’s experience report describes Codex working 45+ minutes continuously without losing the thread. The cloud-container architecture lets you fire-and-disconnect: hand off a task, close the laptop, come back when it’s done. That sustained-run profile is the operational reason it works as a worker. The driver doesn’t have to babysit it.

GPT-5.5 leads on Terminal-Bench 2.0 at 82.7%, OSWorld-Verified (computer use) at 78.7%, GDPval at 84.9%, and Tau2-bench Telecom at 98.0%. OpenAI says 85%+ of the company uses Codex weekly across engineering, finance, comms, marketing, data science, and product. They run it because it executes.

Token efficiency is where the gap compounds at scale. GPT-5.5 uses roughly 72% fewer output tokens than Opus 4.7 on equivalent coding tasks. When the worker is doing the bulk of the volume (terminal runs, mechanical transformations, parallelizable sub-tasks) that efficiency is what makes the dual-tool monthly bill defensible.

The Harness Effect (Why This Comparison Is Mostly About the Harness, Not the Model)

Matt Mayer ran the same model through two different harnesses on identical tasks: Claude Opus scored 77% in Claude Code and 93% in Cursor. Same model, same tasks, sixteen percentage points from the harness alone.

CORE-Bench reproduced the pattern more dramatically. Claude Opus scored 42% with a minimal scaffold and 78% inside Claude Code’s full harness. Thirty-six points of capability appeared from the wrapper, not the weights. Nate’s Newsletter reported the same gap in independent testing: a 36-point spread on identical tasks driven entirely by harness differences.

The harness has four components, per Jonathan Fulton’s architectural breakdown: a loop that decides when to call the model again, a context manager that handles compaction and memory, a tool registry with descriptions and schemas, and an approval system that intercepts tool calls. Codex and Claude Code converge on similar architectures here. The differences that drive the harness effect are subtler: how aggressively each one summarizes context, how many parallel sub-agents it manages, what the default tool descriptions look like, how the system prompt is structured.

If 16-36 percentage points of capability come from the wrapper rather than the weights, then nesting harnesses (putting one inside another in a driver/worker topology) is a way of stacking those gains, not averaging them. The driver gets the planning and integration capability of one wrapper. The worker gets the terminal autonomy and token efficiency of another. The combined system is bigger than either side, and the cross-harness review that emerges from the topology is what catches the bugs neither single harness sees.

How We Run Them Together: Driver/Worker Orchestration

The pattern is hierarchical, not parallel. Driver/Worker Orchestration: Claude Code drives. Codex executes when the driver delegates. Results return up to the driver. Working alternatives include the Planner-Driver Pattern and the Orchestrator/Worker Harness.

Layer	What happens	Why this side
Driver keeps (Claude Code)	Planning, codebase comprehension, architecture decisions, deciding what to delegate, deciding when the task is done	The driver’s job is to hold the whole picture. Long-context coherence and the self-verification sub-agents make it the right tool for the work that has to remember why earlier decisions were made.
Driver delegates to worker (Codex)	Long terminal runs, mechanical transformations, parallelizable sub-tasks, anything where 45+ minute uninterrupted execution and lower per-token cost are the right shape	The worker doesn’t need to hold the whole picture. It needs a scoped task, the ability to run hard for an hour, and the discipline to report back cleanly. Codex’s terminal autonomy and token efficiency fit that shape.
Worker returns to driver	Codex reports results, diffs, test outcomes, and any unresolved questions back up. Claude Code reads the returned work in its own context, reasons over it, integrates it, decides next steps	Review is implicit in the topology rather than a separate “cross-model review pipeline step.” The driver always re-reads the worker’s output before merging it into the plan; cross-harness coverage is a side-effect, not a manual step bolted onto the end.

The driver’s loop never closes. Claude Code spawns Codex, waits for it to finish, then re-engages with the returned work. The next task usually emerges from reasoning over what came back, not from a pre-planned queue. That’s why the topology compounds. Each worker run sharpens the driver’s plan; each driver decision changes the next thing the worker gets asked to do.

Shared context, separate context files. Claude Code reads CLAUDE.md at the project root; Codex reads from ~/.codex/skills/. Both have to know the same conventions or the worker’s output won’t fit cleanly back into the driver’s plan. Chandler’s cross-pollination workflow is the practical answer: have Codex study your existing Claude Code skills and produce equivalents under ~/.codex/skills. Same conventions, two file formats. The Skills standard is converging across both tools, but as of April 2026 you’re still translating between formats.

The cleanest version of this runs Codex from inside the Claude Code session, through an orchestration framework that handles the spawn, wait, and return. The worker doesn’t see the user; it sees the driver. The user sees only the driver. That’s what makes the loop close: Claude Code is the only thing the engineer interacts with directly.

The worker reports structured results: diffs, test results, log excerpts, unanswered questions. The driver reasons better when the worker’s return packet is shaped for reasoning rather than just for human review. This is mostly a matter of how the framework prompts the worker. Most orchestration frameworks now support structured return packets out of the box.

The Orchestration Framework Layer (BEADS+Metaswarm and the 2026 Ecosystem)

The driver/worker topology runs through a framework: the substrate that handles spawn, context handoff, structured return, and session bookkeeping so the driver can pick up where the worker left off. As of April 2026 we run on BEADS with Metaswarm v0.11.0. Metaswarm provides the multi-agent orchestration layer; BEADS handles persistent issue tracking, context priming, and semantic summarization across sessions, exposed as a Claude Code plugin. It’s what we use today. It’s not what we’ll necessarily use next month.

Framework choice is fluid in a way that didn’t exist before agentic coding. Switching between Metaswarm and an alternative is a per-project decision now, not a per-company one. You can scaffold one system, test a different framework on the next sprint, and migrate gradually if the new one earns it. The pattern (Driver/Worker Orchestration) is what holds across framework swaps.

The wider 2026 ecosystem at the harness/framework layer:

BEADS + Metaswarm: our current stack. Metaswarm’s session hooks defer to the standalone BEADS plugin for context priming and decision tracking, which means the driver can survive context compaction without losing the thread.
Archon: described in April 2026 research as the first open-source harness builder for orchestrating Claude Code and Codex together. Worth a look if you want to build your own multi-tool flow rather than wire up shell scripts.
Citadel: agent orchestration harness for Claude Code and Codex with parallel agents in isolated worktrees, four-tier intent routing, and persistent campaign memory across sessions. The closest in scope to BEADS + Metaswarm if you want a different shape on the same problem.
HumanInLoop: open-source strategy harness on top of Claude Code — DAG-based multi-agent coordination with cascade safety, focused on telling each agent what to build and why before delegation. Different angle on the orchestration question.
awesome-harness-engineering: the canonical GitHub corpus on harness patterns. First read for anyone trying to understand what’s actually being built at this layer.

The Codex CLI repo sits at 67k GitHub stars; Claude Code at 114k. The community of practice around both is active enough that the driver/worker topology is being independently rediscovered week by week. Most teams who run both for more than a month end up at some version of it.

Where We’ve Run This (Three Production Categories)

The pattern doesn’t pay for itself on small tasks. Three workload shapes earn it.

Complex code refactoring. Multi-file refactors across a large codebase, where the architecture decision drives a series of mechanical transformations downstream. The driver holds the architecture and the invariants the refactor has to preserve. The worker does the long mechanical pass, file by file, returning diffs and test results. The driver re-reads each return, catches the cases where the mechanical transformation broke an architectural assumption, and either fixes them in-place or sends the worker back with a tightened spec.

WordPress site and server migrations. Building or migrating an entire WordPress site, including the underlying server. The work is a mix of architectural decisions (theme structure, plugin selection, server topology) and long mechanical execution (block migration, content import, server provisioning, deployment scripts). The driver/worker split fits naturally: Claude Code reasons about the architecture and the migration order, Codex executes the long terminal sessions and reports back. Some of these runs go for hours.

Ground-up SAAS rebuilds. Re-platforming an existing SAAS system with upgraded security, statefulness, and reliability. The driver holds the new architecture, the security model, the state-handling decisions. The worker rebuilds modules, runs migration scripts, executes the long test passes that catch regressions. The combined session has been the highest-leverage version of the pattern we run.

The economics across these three categories: teams running this report roughly 80% higher result quality versus single-tool runs of comparable shape, with substantially more code shipped per session and a lower per-task cost (because the worker is doing the volume on the more token-efficient model). Wall-clock per session is slightly slower than single-tool runs would be (the driver/worker handoffs add a few minutes each cycle), but you do other work while the worker runs, so wall-clock isn’t the right unit. The longest single combined run we’ve executed start-to-finish was just under four hours. None of those numbers are A/B-clean; they’re what we see in practice across these three workload shapes.

The same pattern runs on our content side. Our multi-agent content pipeline runs on the same driver/worker structure at a monthly cost equivalent to roughly 3 hours of a mid-level engineer (a planning agent that delegates execution to specialized workers and integrates the returned work). Different domain, same topology. The agent team running that pipeline is structured around the same driver/worker logic at a higher level of abstraction.

What This Costs (At Team Scale)

Scale	Monthly tooling spend	Reference point
Solo developer (driver + worker)	$120-$400	Claude Max $100-$200 + ChatGPT Plus $20 or Pro $200
4-engineer team	$480-$1,200	4× Claude Max + shared/individual ChatGPT seats
Our internal pipeline (10+ agents)	~$450-$600	Cost equivalent to roughly 3 hours of a mid-level engineer per month

A mid-level engineer fully loaded runs $150K-$200K/year, which is $12K-$17K/month. The 4-engineer dual-tool stack pays for itself with single-digit hours of replaced work per engineer per month. The only published case study at large company scale we’ve seen is Anthropic’s own Rust C-compiler internal study: roughly 2,000 sessions, ~$20K total cost, on a 100K-line codebase. That’s vendor-published economics on a single-tool engagement, useful as a reference shape for what large-scale agentic work costs.

The driver/worker version of the bill comes out lower than running everything on Claude Code, because the worker is doing the volume on the more token-efficient model.

A 90-Day Team Adoption Playbook

The driver/worker pattern is teachable, but it doesn’t install itself. Teams that adopt it cleanly tend to follow some version of this rollout.

Weeks 1-2: Get one engineer fluent on the driver

Pick the driver first. Claude Code is the safer default for the driver role for most teams, because the planning, comprehension, and review work is what the driver does and that’s where Claude Code currently leads. Get one engineer fluent before involving anyone else. Set up CLAUDE.md for your codebase. Don’t add the worker yet. The point of this phase is for the engineer to internalize what work the driver actually does and what work it should hand off.

Weeks 3-4: Add the worker inside the driver’s harness

Same engineer now adds Codex as the worker. Pick a framework (BEADS+Metaswarm, Archon, or roll your own) that handles the spawn-and-return mechanics. The single calibration question this phase answers: what work should the driver delegate, and what should it keep? The answer is codebase-specific. By end of week 4, the engineer should have a one-page allocation document that captures it. Run cross-harness review on every non-trivial PR by virtue of the topology, not as a separate step.

Weeks 5-8: Roll out to the team

Other engineers adopt the driver first, then add the worker. Publish your CLAUDE.md, your ~/.codex/skills, and your framework configuration in the repo so the team inherits the same context. Hold a weekly 30-minute review: what did the driver/worker flow catch that single-tool would have missed? What did the framework get in the way of? Adjust the framework config rather than the topology. The topology is the whole point.

Weeks 9-12: Measure and decide on the framework

Three numbers to track. Token cost split between the two harnesses (worker should be doing meaningfully more of the volume; if it isn’t, the driver is over-keeping). Pull requests per engineer per week (delta from before adoption). Regression catch rate (driver re-reads of worker output should catch things that single-tool runs would have shipped). At the 12-week mark, the decision is usually about the framework, not the topology: keep BEADS+Metaswarm, swap to Archon, or move to whatever has appeared in the months since this article was written. The topology survives the framework swap.

Common Pitfalls

Treating the worker as a peer. The point isn’t redundancy or parallel allocation. The worker doesn’t see the user, doesn’t hold the architecture, doesn’t decide when something is done. Treating it as a peer collapses the pattern back into the parallel version that doesn’t compound.
Skipping the result-integration step in the driver. The whole topology depends on the driver re-reading the worker’s output before integrating it. If you let the worker’s diffs auto-merge, you’ve removed most of the value.
Over-anchoring on the framework. Framework switching is cheap now. Pick one, run with it, swap it when something better lands. Don’t build the team’s entire workflow around any specific framework’s idiosyncrasies.
Ignoring token-cost monitoring. Both harnesses can spike unexpectedly. Set thresholds and alerts; the cost-control pattern is detailed in the cost circuit breaker post.

When You Should Not Use Both

The driver/worker pattern earns its overhead on a specific shape of work. Outside that shape, single-tool is the right answer.

If your work fits in one context window or sits cleanly in one category, pick the matching tool and go deep. Driver/worker pays off when the work is large enough that the driver has something to hand off; on small focused tasks or uniform workloads, the handoff overhead exceeds the gain. If your work is 100% terminal-heavy ops, Codex alone is fine. If it’s 100% deep architectural reasoning over a small codebase you can hold in your head, Claude Code alone is fine.

Teams without operational discipline for the handoff topology should skip the second tool until they have it. Running two harnesses without the driver re-reading worker output is just running two harnesses; you get the cost of both with the catch rate of one. The structural discipline matters more than the tool count.

If your team is on one tool and shipping fine, the upgrade priority is probably not adding the second tool. It’s getting better at the one you have. The harness-effect data above (16-36 percentage points hidden in better harness configuration) suggests most teams have meaningful headroom on their current tool before they need a second.

Where Fountain City Fits

We run Driver/Worker Orchestration in our own pipeline and on client engagements. We teach it through agentic coding training for development teams and agencies. When teams want the orchestration built and operated for them rather than learning to run it themselves, that’s the work behind managed autonomous AI agents (also see our agentic development service for build-only engagements). The same driver/worker logic shows up in other agent applications too — see how the pattern shows up in agentic SEO for a different domain example.

If you want to run this yourself, you have what you need. If you want help, that’s the conversation we have.

Frequently Asked Questions

Is GPT-5.5 better than Claude Opus 4.7 for coding?

Neither is uniformly better. Opus 4.7 leads on SWE-bench Pro (64.3% vs 58.6%) and on architecture-heavy benchmarks (CursorBench 70%, GPQA Diamond 94.2%). GPT-5.5 leads on Terminal-Bench 2.0 (82.7%), OSWorld-Verified (78.7%), and Tau2-bench Telecom (98.0%), and uses ~72% fewer output tokens on equivalent tasks. The right answer depends on what shape of work dominates your team. For mixed workloads, the answer is to use both, with Claude Code as the driver and Codex as the worker, per the topology described above.

Should I use Codex or Claude Code if I can only afford one?

If your work is heavily terminal-based, ops-heavy, or token-cost-sensitive, pick Codex. If your work is architecture-heavy, involves long multi-file refactors, or requires sustained reasoning over ambiguous specs, pick Claude Code. Solo developers with mixed workloads typically default to Claude Code for the planning sophistication and add ChatGPT Plus ($20/mo) only when they hit a workload Claude Code is poor at, at which point they’re effectively running the driver/worker pattern at a small scale.

Can I use Claude Code’s CLAUDE.md context with Codex?

Not directly. Codex reads from ~/.codex/skills/. The practical workaround is the cross-pollination pattern: ask Codex to study your CLAUDE.md and your Claude Code plugins, then generate equivalent skills under ~/.codex/skills. The Skills standard is converging across both tools, so over time this is becoming more portable, but as of April 2026 you’re still translating between formats.

What is the harness effect, and why does it matter for the driver/worker pattern?

The harness effect is the capability gap between the same model running in two different harnesses. Matt Mayer’s research found Claude Opus scoring 77% in Claude Code and 93% in Cursor on identical tasks, with 16 percentage points coming purely from the harness. CORE-Bench found a 36-point gap in similar testing. The implication for the driver/worker pattern: nesting harnesses stacks the harness gains rather than averaging them. The driver gets one wrapper’s planning capability; the worker gets another’s terminal autonomy and token efficiency. That’s what makes the topology compound rather than dilute.

Are there open-source alternatives to Claude Code and Codex?

Yes. OpenCode is the most prominent: open-source with an apply_patch tool tuned for Codex-model performance. Archon is the open-source harness builder for orchestrating multiple coding agents. The Skills standard (Anthropic-originated, now multi-tool) makes cross-tool portability practical. The awesome-harness-engineering GitHub repo is the canonical inventory. We currently run BEADS+Metaswarm on top of Claude Code as the driver and Codex as the worker; the framework choice is fluid.

How long does it take a team to adopt the driver/worker workflow?

Roughly 90 days from cold start to measured rollout. Two weeks for the first engineer to get fluent on the driver alone. Two more weeks to add the worker and calibrate the delegation pattern for that codebase. Four weeks of team rollout. Four weeks of measurement before deciding whether to keep the framework or swap it. The full playbook is in the section above.

Last updated: April 2026. Both Codex and Claude Code update frequently, and the framework layer (BEADS+Metaswarm, Archon, OpenCode, others) moves faster than either model. We’ll refresh this article as Opus 4.8 and GPT-5.6 land, and as the framework choice changes.

Claude Code and Codex Together: Driver/Worker Orchestration in Production

The Quick Verdict