The Model Got Good Enough to Run a Team of Itself

I posted two screenshots last week. One said "Claude Opus 4.8 is so smart." The other said "Ok. Anthropic team definitely cooked with it." That was the whole caption. Here's what was on the screen, and why I reacted that way.

Both shots are the same thing happening: a fan-out. Not one model answering one question — a coordinator spinning up a row of independent agents, each given a slice of the problem, all running at once, then a final pass that reconciles what they found.

A Claude Code session fanning out five read-only performance-review agents across an iOS codebase, each owning one performance surface. — Five agents, one performance surface each, all running at once. The caption I gave this on X was three words: so smart.

What I Was Actually Running

The first screenshot is a performance audit of an iOS app. The codebase is too big to hold in one pass and reason about carefully — about a hundred files. So instead of one agent skimming all of it, I split the app by performance surface: the main scroll-and-render hot path, the image cache and prefetch layer, the state-observation layer, the map and gallery, and the AI/onboarding views. Five agents, each read-only, each owning one surface, each producing a report in the same format. Then I synthesize.

The second screenshot is bigger: a security-and-performance audit across an app's whole API surface — the admin endpoints, the data delivery, the cloud functions proxy, the web client, the iOS networking. Eleven agents in the audit phase, then a verification phase where findings get attacked adversarially before anything is trusted, then a synthesis. The counter in the corner read fourteen of eighty-two agents when I grabbed the shot.

A multi-agent audit workflow: eleven agents in the audit phase across an app's API surface, then verify and synthesize phases, the counter showing fourteen of eighty-two agents. — Eleven agents auditing an API surface, then an adversarial verify pass, then synthesis. Fourteen of eighty-two done when I hit screenshot.

Why This Felt Different

I've been running multi-agent setups for a while. What changed isn't that fan-out became possible. It's that the agents stopped dropping things.

Each one runs with a very large context window, so it can hold an entire subsystem — every file in its surface, the conventions, the related call sites — in its head at the same time, instead of paging pieces in and out and losing the thread halfway through. The reports come back specific. They cite real lines. They understand why the code is shaped the way it is before they suggest changing it.

And the synthesis at the end actually reconciles. When two agents disagree, or one finds a bug that another's fix would reintroduce, the merge step catches it instead of stapling the reports together. That used to be the part I had to do by hand. Now I mostly check that it did it right.

That's the moment that earned the screenshot. Not a benchmark. The feeling of kicking off a piece of work that would have taken a real team a day, watching the row of agents light up, and getting back something I'd have been happy to receive from people.

The Honest Part

It's not magic and I'm not going to pretend it is. The agents are only as good as the scoping. If I hand them a vague surface, I get vague reports. They need explicit boundaries — this file, not that one — or they collide. And the verification phase exists precisely because a confident-sounding finding is sometimes wrong; the adversarial pass is there to kill the plausible-but-false ones before they reach me. I still read everything. I still make the calls.

But the floor moved. The work that used to need me sitting in the loop, nudging, re-prompting, catching dropped context, now mostly runs. I set the shape, it executes, I review.

What This Means for a Company of One

I've written before about replacing a team of five with one tool. The honest update is that it doesn't feel like one tool doing the work of five people anymore. It feels like one model that's good enough to coordinate a team of itself — to split a job across a dozen copies, keep them from stepping on each other, and pull the results back into something coherent.

That's a strange thing to have on a laptop. A solo founder with, functionally, a reviewing department, an audit department, and a synthesis lead — all the same underlying model, all running in parallel, all done before lunch.

So yeah. They cooked.