Auditing and Optimizing a Runaway Claude API Orchestrator: From $45/day to $2-3/day
When your infrastructure bill spikes without warning, the instinct is to panic and start cutting. But the real engineering work happens when you systematically trace every dollar to its source. This post documents how we identified and fixed a cost explosion in our Claude API orchestration system running on EC2/Lightsail—reducing daily spend by ~95% through targeted model and termination logic changes.
The Problem Statement
Our system was burning approximately $45 per day across 4–5 API orchestrator reloads. With multiple Python scripts, Lambda functions, and daemon processes touching the Anthropic API, the cost wasn't obviously tied to any single component. We needed a complete audit before making cuts.
Discovery: The Culprit Was the Daemon, Not the Scripts
The investigation started by mapping every Anthropic API call across our codebase. We examined:
/Users/cb/Documents/repos/sites/queenofsandiego.com/tools/jada_daily.py— scheduled Python ingestion tasks/Users/cb/Documents/repos/portfolio-intel/daily.py— portfolio analysis batch jobs/Users/cb/Documents/repos/sites/quickdumpnow.com/tools/qdn_clean_load_daily.py— data pipeline processingjada_daemon.sh— the long-running orchestrator daemon on the Lightsail instance at34.239.233.28
The scheduled Python scripts were cheap: combined, they cost only ~$0.38 per day. The real issue was the daemon.
Technical Details: Why the Daemon Was Hemorrhaging Money
jada_daemon.sh runs on our Lightsail instance and processes "agent-work" tasks from a queue by invoking the Claude CLI directly:
claude [task-from-queue] [no model specified] [no max-turns limit]
Each invocation inherited a massive context injection from ACTIVE.md, our active task memory file (~475 lines, approximately 15,000 tokens). The daemon would then:
- Start a new Claude session
- Inject 15K–25K tokens of context upfront
- Let the agent loop run indefinitely (no
--max-turnsflag) - Allow context to grow to 150K–300K tokens over 30–100 agent turns
- Use the default model (at that time, not explicitly pinned to a cheaper tier)
At Sonnet 4.6 pricing (~$0.003 per 1K input tokens, ~$0.015 per 1K output tokens), each session cost $8–15. With 4–5 sessions per day, that's $40–75/day—matching our observed spend.
Root Cause Analysis
Three design gaps converged to create the problem:
- No termination boundary: The CLI invocation had no
--max-turnslimit, allowing agents to loop indefinitely. - No model tier selection: The daemon wasn't explicitly specifying Haiku for routine orchestration tasks that didn't need Sonnet or Opus intelligence.
- No cost guard: No token cap, no spend ceiling, no monitoring alert.
The Fix: Two Simple Changes on the Server
We made two targeted edits to jada_daemon.sh on the Lightsail instance:
Change 1: Cap the agent loop at 30 turns
claude [input] --max-turns 30
Most orchestration tasks complete within 15–20 turns. Capping at 30 gives breathing room while preventing the 100+ turn runaway scenarios we observed. This alone cuts token consumption by ~60–70%.
Change 2: Pin the model to Haiku for orchestration
export ANTHROPIC_MODEL=claude-haiku-4-5-20251001
claude [input] --max-turns 30
Haiku is 5–10x cheaper than Sonnet and handles task coordination, routing, and decision-making well. The daemon doesn't need Sonnet's reasoning depth or Opus's extended context handling. By pinning the environment variable before the CLI call, we ensure every invocation defaults to Haiku unless explicitly overridden.
After making these changes and restarting the jada-agent service on the Lightsail instance:
sudo systemctl restart jada-agent
Spend dropped from ~$45/day to ~$2–3/day.
Why These Changes Work
Max-turns as a circuit breaker: The --max-turns flag is not a bug-hiding hack. It's a principled architectural constraint. Orchestration tasks that genuinely need more than 30 turns should be redesigned—either split into subtasks or use a different pattern. Forcing that redesign is healthy.
Model selection as a cost center: Not every AI task needs Sonnet. Haiku handles routing, data transformation, and basic reasoning at a fraction of the cost. We reserve Sonnet and Opus for specific high-value tasks (deep analysis, complex planning) that we invoke explicitly, not as a default.
Environment variable precedence: By setting ANTHROPIC_MODEL in the daemon's shell environment before spawning Claude CLI processes, we ensure consistent model selection without hardcoding it into multiple scripts. Future changes to the daemon's model strategy only require editing one line.
Monitoring and Verification
After deploying the changes, we:
- Verified all model strings across the codebase matched the new strategy (no Sonnet/Opus in the daemon code path)
- Confirmed the service restarted cleanly and tasks continued processing normally
- Monitored Anthropic API billing for 24 hours to validate the spend reduction
- Spot-checked agent output quality—no degradation observed
Lessons and Next Steps
This incident taught us several things:
- Model selection matters more than you think: The default isn't always appropriate. Audit your model choices quarterly.
- Termination logic is a cost control: Unbounded loops in AI orchestration are a liability. Always set a reasonable upper bound.
- Injected context adds up: Our 15K-token context was hidden but real. We should audit and prune
ACTIVE.mdregularly.
Going forward, we're implementing:
- A cost budget in the daemon startup script that kills the process if spend exceeds a threshold per run
- Quarterly audits of all Anthropic API call sites using automated grep and token counting
- A monitoring dashboard that tracks token spend per daemon run, per script, and per model
The full audit report, including file:line citations for every API call and detailed termination logic diagrams, was sent to the engineering team separately. This post captures the essence: systematic investigation, root cause identification, and surgical fixes with immediate, measurable results.