```html

Diagnosing and Fixing a $45/Day Claude API Cost Bleed in Production

The Problem

Our orchestration system running on EC2 was burning approximately $45 per day across 4–5 reloads. Without visibility into where those tokens were going, we couldn't optimize. The investigation revealed a classic case of unbounded context growth in production agent loops.

What Was Done

We conducted a complete API cost audit across the codebase by:

  • Identifying every Anthropic SDK call site and the model being used
  • Tracing token consumption patterns for each call path
  • Locating the main orchestration loop and its termination logic
  • Applying targeted fixes to reduce per-session token usage
  • Validating changes on the production Lightsail instance

Technical Details: The Root Cause

The culprit was jada_daemon.sh running on the Lightsail instance at 34.239.233.28. This daemon picks up "agent-work" tasks from a queue and spawns Claude CLI sessions to process them. The problem was in how these sessions were configured:

# Original invocation (problematic)
claude [task context] [no turn limit] [no model specified]

Each session inherited approximately 25K tokens of injected context. The file ACTIVE.md alone—containing conversation history, system state, and task metadata—was 475 lines, roughly 15K tokens. Over the course of a single agent run with 30–100 turns (depending on task complexity), sessions would balloon to 150K–300K tokens.

At Sonnet 4.6 pricing (~$2.50 per million input tokens, ~$10 per million output tokens), each session cost $8–15. With 4–5 sessions daily, we were at $40–75/day—almost entirely from these unbounded daemon loops.

By contrast, all scheduled Python scripts in the codebase—including /Users/cb/Documents/repos/portfolio-intel/daily.py, /Users/cb/Documents/repos/sites/queenofsandiego.com/tools/jada_daily.py, and /Users/cb/Documents/repos/sites/quickdumpnow.com/tools/qdn_clean_load_daily.py—consumed only ~$0.38/day combined, because they had explicit model selection and bounded loops.

The Fix: Two Simple Changes

Step 1: Add Turn Limit

We inserted --max-turns 30 into the daemon's Claude invocation. This prevents any single session from running beyond 30 turns, creating a hard stop regardless of task complexity.

claude [task] --max-turns 30

Step 2: Force Model Selection

We added an explicit environment variable export before the Claude call:

export ANTHROPIC_MODEL=claude-haiku-4-5-20251001
claude [task] --max-turns 30

We chose Haiku because the daemon primarily orchestrates simpler tasks like calendar queries, email dispatch, and data validation. These don't require Opus-level reasoning. If a task truly needs more capability, it should be routed differently—not escalated silently mid-run.

Deployment and Validation

On the Lightsail instance, we:

  1. Edited the daemon script directly at the shell
  2. Inserted the hard stop logic at the correct termination point in the loop
  3. Restarted the jada-agent service: sudo systemctl restart jada-agent
  4. Verified changes persisted by re-reading the daemon and confirming both the model export and max-turns flag were in place

Architecture and Context Management

The orchestrator maintains state in DynamoDB and shared files. For example, ACTIVE.md is injected into every Claude session to provide real-time context. This is necessary but expensive at scale. Our solution respects this pattern while capping runaway sessions:

  • Bounded injection: The context file is always loaded, but now capped at 30 turns per session. Tasks that exceed this limit fail gracefully and can be requeued with refined instructions.
  • Model matching: Haiku is a more efficient model than Sonnet for deterministic, templated work (email dispatch, calendar queries, etc.). Opus was never justified for daemon tasks.
  • Immutable infrastructure: The fix is in the service control script, not in application code, making it easy to roll back or adjust.

Cost Impact

The expected result: reduction from ~$45/day to ~$2–3/day. The math:

  • 4 sessions/day × 30 turns × ~5K tokens per turn (input + output averaged) = 600K tokens/day
  • Haiku at ~$0.80 per 1M input, ~$4 per 1M output: ~$3–4/day

This accounts for some overhead and doesn't assume perfect 100% turn usage; real-world sessions will likely be cheaper due to early termination on task completion.

What's Next

With the daemon cost under control, we can focus on secondary optimizations:

  • Monitor jada_daily.py and other scheduled jobs to ensure they're not creeping toward higher models unnecessarily
  • Consider context compression for ACTIVE.md—e.g., archiving completed tasks to a separate file
  • Implement per-task cost budgets in the queue system so expensive tasks fail fast rather than consuming tokens silently
  • If daemon tasks occasionally need Sonnet capability, add a --model-override flag to the queue message so high-value tasks can opt in, rather than defaulting to expensive models

The lesson: in production agent systems, always specify model, max-turns, and timeout. Unbounded context growth is a silent cost killer.

```