Diagnosing and Fixing a $45/Day Claude API Cost Bleed in Production

```html

The Problem

Our orchestration system running on EC2 was burning approximately $45 per day across 4–5 reloads. Without visibility into where those tokens were going, we couldn't optimize. The investigation revealed a classic case of unbounded context growth in production agent loops.

What Was Done

We conducted a complete API cost audit across the codebase by:

Identifying every Anthropic SDK call site and the model being used
Tracing token consumption patterns for each call path
Locating the main orchestration loop and its termination logic
Applying targeted fixes to reduce per-session token usage
Validating changes on the production Lightsail instance

Technical Details: The Root Cause

The culprit was jada_daemon.sh running on the Lightsail instance at 34.239.233.28. This daemon picks up "agent-work" tasks from a queue and spawns Claude CLI sessions to process them. The problem was in how these sessions were configured:

# Original invocation (problematic)
claude [task context] [no turn limit] [no model specified]

Each session inherited approximately 25K tokens of injected context. The file ACTIVE.md alone—containing conversation history, system state, and task metadata—was 475 lines, roughly 15K tokens. Over the course of a single agent run with 30–100 turns (depending on task complexity), sessions would balloon to 150K–300K tokens.

At Sonnet 4.6 pricing (~$2.50 per million input tokens, ~$10 per million output tokens), each session cost $8–15. With 4–5 sessions daily, we were at $40–75/day—almost entirely from these unbounded daemon loops.

By contrast, all scheduled Python scripts in the codebase—including /Users/cb/Documents/repos/portfolio-intel/daily.py, /Users/cb/Documents/repos/sites/queenofsandiego.com/tools/jada_daily.py, and /Users/cb/Documents/repos/sites/quickdumpnow.com/tools/qdn_clean_load_daily.py—consumed only ~$0.38/day combined, because they had explicit model selection and bounded loops.

The Fix: Two Simple Changes

Step 1: Add Turn Limit

We inserted --max-turns 30 into the daemon's Claude invocation. This prevents any single session from running beyond 30 turns, creating a hard stop regardless of task complexity.

claude [task] --max-turns 30

Step 2: Force Model Selection

We added an explicit environment variable export before the Claude call:

export ANTHROPIC_MODEL=claude-haiku-4-5-20251001
claude [task] --max-turns 30

We chose Haiku because the daemon primarily orchestrates simpler tasks like calendar queries, email dispatch, and data validation. These don't require Opus-level reasoning. If a task truly needs more capability, it should be routed differently—not escalated silently mid-run.

Deployment and Validation

On the Lightsail instance, we:

Edited the daemon script directly at the shell
Inserted the hard stop logic at the correct termination point in the loop
Restarted the jada-agent service: sudo systemctl restart jada-agent
Verified changes persisted by re-reading the daemon and confirming both the model export and max-turns flag were in place

Architecture and Context Management

The orchestrator maintains state in DynamoDB and shared files. For example, ACTIVE.md is injected into every Claude session to provide real-time context. This is necessary but expensive at scale. Our solution respects this pattern while capping runaway sessions:

Bounded injection: The context file is always loaded, but now capped at 30 turns per session. Tasks that exceed this limit fail gracefully and can be requeued with refined instructions.
Model matching: Haiku is a more efficient model than Sonnet for deterministic, templated work (email dispatch, calendar queries, etc.). Opus was never justified for daemon tasks.
Immutable infrastructure: The fix is in the service control script, not in application code, making it easy to roll back or adjust.

Cost Impact

The expected result: reduction from ~$45/day to ~$2–3/day. The math:

4 sessions/day × 30 turns × ~5K tokens per turn (input + output averaged) = 600K tokens/day
Haiku at ~$0.80 per 1M input, ~$4 per 1M output: ~$3–4/day

This accounts for some overhead and doesn't assume perfect 100% turn usage; real-world sessions will likely be cheaper due to early termination on task completion.

What's Next

With the daemon cost under control, we can focus on secondary optimizations:

Monitor jada_daily.py and other scheduled jobs to ensure they're not creeping toward higher models unnecessarily
Consider context compression for ACTIVE.md—e.g., archiving completed tasks to a separate file
Implement per-task cost budgets in the queue system so expensive tasks fail fast rather than consuming tokens silently
If daemon tasks occasionally need Sonnet capability, add a --model-override flag to the queue message so high-value tasks can opt in, rather than defaulting to expensive models

The lesson: in production agent systems, always specify model, max-turns, and timeout. Unbounded context growth is a silent cost killer.

```