Orchestrator

The Orchestrator: agents that build Django apps, without the runaway token bill

I’ve been shifting more of my Claude Code work onto subagents over the past few months. It works well enough, but a normal session is still a lot of waiting around: kick off an agent, wait, answer a question, give the next instruction, wait again. The natural next move is to let it run further without me babysitting every step. The thing stopping me is simple. I don’t want to set something loose that chews through my entire token budget while my attention is somewhere else.

So instead of a fire-and-forget agent, I built a small Django app to sit in the middle and keep me in charge of the expensive moments. I call it the Orchestrator.

There was a second reason too. When you hand an LLM a vague “build me an app” prompt, the easiest thing for it to do is reach for some existing project shape and copy it. You end up with something that works but that you don’t really understand, built to a pattern you didn’t choose. I’d rather own the base. These days I scaffold new projects from a fixed stack (Django, uv, Tailwind, DaisyUI, htmx and Alpine, plus a small instructions file the agent reads automatically), and the agent fills in the gaps rather than inventing the skeleton. The side benefit is that when the structure is fixed, it’s much easier to see how good the model really is at the part that’s left. Authentication, for example, I can wire up the way the project needs it: Okta or AD for the enterprise stuff, a social login for something personal. The blueprint is mine; the agent does the filling-in.

Two repos, and keeping them apart

The whole design rests on not conflating two codebases.

The first is the orchestrator itself, this repo. It’s a plain, boring Django app, and that’s the point. It copies a repo, launches a container, reads a result file, updates a row in the database, maybe queues the next step. It contains no AI at all. It does no reasoning. If that sounds underwhelming, good.

The second is the base repo: a Django template that’s green by default, meaning its test suite passes the moment you clone it. It gets copied once per run and handed to an agent as a workspace to fill in.

All the actual intelligence lives inside a container, in a Claude agent. The orchestrator never calls a model, never improvises, never makes a judgement call. It’s a state machine and nothing more. Once I’d accepted that split, most of the rest of the design fell out of it on its own.

Three agents, two gates

A run isn’t one big agent doing everything. It’s a fixed line of three single-shot agents, each in its own throwaway container, each with its own model and effort level. Between them sit two points where the run stops and waits for me.

pending → designing →[ pick a design ]→ planning →[ approve the plan ]→ running → succeeded

The Designer goes first. It takes the plain-English description and produces a handful of distinct visual takes, each one a self-contained HTML mockup with its own palette and a short note on the thinking behind it. I look at them side by side and pick the one I want. This runs on Sonnet at medium effort; it doesn’t need to be clever, it needs to give me options.

Desingner Desingner1

The Planner takes the design I picked and writes a concrete plan for it: the models, the endpoints, the tests, notes on the UI. This one runs on Opus at the highest effort I’ll pay for, because the plan is where all the leverage is. Get the plan right and the build is mostly mechanical. I can edit the plan freely before I approve it, which matters more than it sounds. Fixing a wrong assumption in a short text file is trivial; fixing it after a full build is a wasted build.

Planner Planner

The Implementer takes the approved plan and the chosen design and builds the thing. It edits files and runs commands inside its own container, looping against the test suite until everything passes.

One thing I was deliberate about: each agent’s model and effort come from static config, not from anything the system decides on the fly. There’s no clever routing layer picking models for me. It’s a dial I set, and if I want Opus somewhere else I change the config.

Why it stops between stages

The gates aren’t a limitation I’m apologising for. They’re the reason I built this instead of using a fully autonomous agent.

A “describe it and walk away” agent makes a hundred small decisions you never see, and you find out about the wrong ones at the end, after the money is already gone. Stopping at the two right moments flips that around. “Which of these looks right?” is a five-second call for a human and very hard for a model to make on my behalf. “Is this plan sound?” is something I can answer by reading a short, editable document before any real work happens.

The part I like most is that a stopped gate costs nothing. It isn’t a process sitting in a loop waiting on me. It’s just the absence of a queued next step. A run can sit at “awaiting approval” for a week and burn zero tokens. Clicking approve is what queues the implementer in the first place. And if I don’t like any of the designs, I can run the designer again from the same gate.

The container is the boundary

The agent writes and runs arbitrary code, and that code can’t be trusted. The entire safety story comes down to one decision: the Agent SDK runs inside the container, never on the host.

This matters because the SDK’s bash tool executes wherever the SDK process is running. On the host, that’s just a shell on my machine, no sandbox at all. Inside a docker run --rm container, it’s a blast radius that disappears when the run ends. So the host launches exactly one container per phase and does no model work itself.

Two details here are easy to get wrong, and both bit me while I was thinking it through.

The container gets a CLAUDE_CODE_OAUTH_TOKEN so it bills against my Max subscription, and I explicitly strip ANTHROPIC_API_KEY out of its environment. If the API key ever leaks in, the SDK quietly prefers it and you start paying API rates with no warning. So the token goes in and the key stays out. Getting that backwards is a great way to discover a bill you didn’t expect.

The host and the agent also never talk to each other directly; everything crosses as files. Before launch, the host writes _task.txt into the mounted workspace. When the agent finishes, it writes _result.json (status, summary, files changed, token counts) and _transcript.json (every message and tool call). The result file is the finish signal. The orchestrator never reads free-text prose to work out what happened. It reads a JSON file, because parsing an agent’s chatter to decide what it did is asking for trouble.

Why I trust what comes out

The base repo’s test suite passes out of the box, and the implementer’s container checks that it’s still green before the agent touches anything. That’s what makes the result trustworthy: if the suite is red at the end, that’s provably the agent’s doing and not some mess that was already there. (A red baseline means the base image itself is broken, which is a different problem, and it gets reported as one.)

From there the agent iterates until it’s green again. The green check is kept narrow: makemigrations, migrate, test. CSS compilation runs too, so the design’s palette actually reaches the built app, but it sits off the green gate. A styling hiccup should never fail a build that’s logically correct.

The dashboard, and watching the cost

The front end is a single-page Django app, htmx for live updates and no websockets. It gives me the list of runs with live status, token spend and cost; a detail view that refreshes itself while a phase is running; the design picker, with each mockup rendered in a sandboxed iframe; and the plan editor for the approval step. The gates are driven by me clicking things, so they don’t poll. The active phases do.

Cost isn’t an afterthought here, it’s most of the point. Every agent run records its input and output tokens, and because a regenerate just adds another run, the cost of a build adds up across every retry, regenerates included. This is the cheap way to learn the real burn rate of the pipeline before I have to make any decision about scaling it: whether Opus on the planner pays for itself, what a regenerate really costs me, which roles are worth which tier. The numbers are there from the first run.

Once a build succeeds I can launch its app and click around, through the same container boundary as everything else. A preview is a docker run -d --rm of the agent image with the entrypoint swapped out: uv sync —frozen, migrate, check, runserver. The —frozen and the check make a broken app fail loudly and exit, instead of serving me a confusing 500. Docker picks a free port and the dashboard offers an “Open” link, but only after reconciling against docker ps, since the recorded port is just intent and a container can die and leave it stale. The preview runs no SDK, so it gets neither the token nor the key. It’s only serving code.

What I left out on purpose

I’m keeping the scope tight here, and I treat that as a feature. This is explicitly not heading towards:

Parallel, recursive, or dynamically spawned agents. It’s a fixed line of three, one container at a time, never concurrent.
A cross-agent fix-it loop. An agent self-correcting inside its own container is fine and free; an orchestrator shuffling work back and forth between agents, a reviewer kicking things back to the implementer, is the hard part, and it stays out.
Model routing or a budget governor. Usage gets logged and models come from static config, but nothing dynamic and nothing that caps spend.
An LLM running the show. The orchestrator stays a dumb, deterministic state machine until this simple version has earned the right to be something smarter.

If a change I’m tempted by seems to need one of these, I take that as a sign to stop and reconsider rather than to start building.

The stack

Django 6, Python 3.12+, Postgres. uv for everything, no pip or poetry. Background work runs on django-tasks-db rather than Celery, which is the small decision that makes everything else possible: a phase becomes a queued task instead of a long-running process, which is exactly why a gate can pause for a week at no cost. The dashboard is TailwindCSS 4 with DaisyUI, Alpine, htmx and django-cotton. The agent container is python:3.12-slim plus the Claude Agent SDK plus Node 22, and the Node is only there so the built app’s CSS can compile. The SDK itself doesn’t need it.

It’s early, and I haven’t run it at anything like scale. But it does the one thing I wanted: it lets agents do the building while I keep my hand on the parts that cost money and the parts that need taste.

Link to repo if you want to test it.

https://github.com/psgandalf/agent-dashboard/

#ai

Django agent-orchestrator