Diagnosing and Fixing a $45/Day Claude API Cost Bleed in Production
The Problem
Our orchestration system running on EC2 was burning approximately $45 per day across 4–5 reloads. Without visibility into where those tokens were going, we couldn't optimize. The investigation revealed a classic case of unbounded context growth in production agent loops.
What Was Done
We conducted a complete API cost audit across the codebase by:
- Identifying every Anthropic SDK call site and the model being used
- Tracing token consumption patterns for each call path
- Locating the main orchestration loop and its termination logic
- Applying targeted fixes to reduce per-session token usage
- Validating changes on the production Lightsail instance
Technical Details: The Root Cause
The culprit was jada_daemon.sh running on the Lightsail instance at 34.239.233.28. This daemon picks up "agent-work" tasks from a queue and spawns Claude CLI sessions to process them. The problem was in how these sessions were configured:
# Original invocation (problematic)
claude [task context] [no turn limit] [no model specified]
Each session inherited approximately 25K tokens of injected context. The file ACTIVE.md alone—containing conversation history, system state, and task metadata—was 475 lines, roughly 15K tokens. Over the course of a single agent run with 30–100 turns (depending on task complexity), sessions would balloon to 150K–300K tokens.
At Sonnet 4.6 pricing (~$2.50 per million input tokens, ~$10 per million output tokens), each session cost $8–15. With 4–5 sessions daily, we were at $40–75/day—almost entirely from these unbounded daemon loops.
By contrast, all scheduled Python scripts in the codebase—including /Users/cb/Documents/repos/portfolio-intel/daily.py, /Users/cb/Documents/repos/sites/queenofsandiego.com/tools/jada_daily.py, and /Users/cb/Documents/repos/sites/quickdumpnow.com/tools/qdn_clean_load_daily.py—consumed only ~$0.38/day combined, because they had explicit model selection and bounded loops.
The Fix: Two Simple Changes
Step 1: Add Turn Limit
We inserted --max-turns 30 into the daemon's Claude invocation. This prevents any single session from running beyond 30 turns, creating a hard stop regardless of task complexity.
claude [task] --max-turns 30
Step 2: Force Model Selection
We added an explicit environment variable export before the Claude call:
export ANTHROPIC_MODEL=claude-haiku-4-5-20251001
claude [task] --max-turns 30
We chose Haiku because the daemon primarily orchestrates simpler tasks like calendar queries, email dispatch, and data validation. These don't require Opus-level reasoning. If a task truly needs more capability, it should be routed differently—not escalated silently mid-run.
Deployment and Validation
On the Lightsail instance, we:
- Edited the daemon script directly at the shell
- Inserted the hard stop logic at the correct termination point in the loop
- Restarted the
jada-agentservice:sudo systemctl restart jada-agent - Verified changes persisted by re-reading the daemon and confirming both the model export and max-turns flag were in place
Architecture and Context Management
The orchestrator maintains state in DynamoDB and shared files. For example, ACTIVE.md is injected into every Claude session to provide real-time context. This is necessary but expensive at scale. Our solution respects this pattern while capping runaway sessions:
- Bounded injection: The context file is always loaded, but now capped at 30 turns per session. Tasks that exceed this limit fail gracefully and can be requeued with refined instructions.
- Model matching: Haiku is a more efficient model than Sonnet for deterministic, templated work (email dispatch, calendar queries, etc.). Opus was never justified for daemon tasks.
- Immutable infrastructure: The fix is in the service control script, not in application code, making it easy to roll back or adjust.
Cost Impact
The expected result: reduction from ~$45/day to ~$2–3/day. The math:
- 4 sessions/day × 30 turns × ~5K tokens per turn (input + output averaged) = 600K tokens/day
- Haiku at ~$0.80 per 1M input, ~$4 per 1M output: ~$3–4/day
This accounts for some overhead and doesn't assume perfect 100% turn usage; real-world sessions will likely be cheaper due to early termination on task completion.
What's Next
With the daemon cost under control, we can focus on secondary optimizations:
- Monitor
jada_daily.pyand other scheduled jobs to ensure they're not creeping toward higher models unnecessarily - Consider context compression for
ACTIVE.md—e.g., archiving completed tasks to a separate file - Implement per-task cost budgets in the queue system so expensive tasks fail fast rather than consuming tokens silently
- If daemon tasks occasionally need Sonnet capability, add a
--model-overrideflag to the queue message so high-value tasks can opt in, rather than defaulting to expensive models
The lesson: in production agent systems, always specify model, max-turns, and timeout. Unbounded context growth is a silent cost killer.
```