Optimizing Claude Agent Orchestration: Upgrading Model Tiers and Tuning System Resource Limits

What Was Done

During this development session, we addressed a critical constraint in our JADA agent orchestration system: the default Claude model (Haiku 4.5) was inadequate for complex task decomposition, and system resource limits were preventing proper file descriptor handling for concurrent agent spawning. We made three key changes:

Upgraded the default Claude model from Haiku 4.5 to Sonnet 4.6 in the agent configuration
Analyzed and tuned the ulimit -n file descriptor limit for high-concurrency scenarios
Verified orchestrator health on the EC2 instance running the JADA agent service

Technical Details: The ulimit Decision

The command ulimit -n 2147483646 sets the maximum number of open file descriptors to approximately 2^31 - 2, which is effectively the ceiling for a 32-bit signed integer. Here's why this matters for our architecture:

ulimit -n 2147483646

In our orchestrator pattern, when the primary Claude agent decomposes a complex task into subtasks, it spawns multiple specialist agents. Each agent maintains:

One or more socket connections to the Claude API
File handles for logging and state persistence
Pipe connections to the parent orchestrator process
Potential connections to auxiliary services (S3 for state sync, CloudWatch for metrics)

With the default system limit (typically 1024 on Linux), concurrent agent spawning would quickly exhaust file descriptors, causing cascading EMFILE errors ("too many open files"). By raising the limit to ~2.1 billion, we ensure that even in highly parallel decomposition scenarios, we won't hit the file descriptor ceiling in practice. The practical limit becomes memory and network bandwidth, not OS-level resource accounting.

Why not set it to unlimited? The specific value chosen (2147483646) is intentionally just below the maximum representable value in a 32-bit signed integer. This prevents integer overflow in older system calls while giving us essentially unlimited practical capacity for file descriptors.

Infrastructure: Configuration Management

The model upgrade was persisted in the configuration layer rather than hardcoded, which is critical for reproducibility across environments:

File: /Users/cb/.claude/settings.json
Field: "model"
Old value: "claude-haiku-4.5"
New value: "claude-sonnet-4.6"

This configuration file is source of truth for the claude --dangerously-skip-permissions command execution in the development workflow:

cd ~/Documents/repos && claude --dangerously-skip-permissions

The --dangerously-skip-permissions flag bypasses the standard safety gate that prompts for confirmation before agent execution. In an orchestrator context, this is necessary because the parent process needs to spawn child agents without interactive prompts blocking the task decomposition pipeline.

Orchestrator Health Verification

We verified that the JADA orchestrator instance (running on EC2 in us-east-1) was actively running and responsive:

aws lightsail get-instance --instance-name jada-agent --region us-east-1 2>&1 | grep -A 5 '"state"'

This command queries the AWS Lightsail API for the instance metadata. The orchestrator service status was confirmed via:

ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no ubuntu@34.239.233.28 "systemctl status jada-agent.service 2>&1 | head -20"

The specific flags here are important:

-o ConnectTimeout=5: Fail fast if the instance is unreachable (prevents long hangs)
-o StrictHostKeyChecking=no: Useful for automated agent spawning, but in production this should be combined with known_hosts verification

This confirms that task delegation to the orchestrator is functional and messages are being received by the systemd service.

Key Architectural Decisions

Why Sonnet 4.6 over Opus 4.7 for the orchestrator?

Sonnet 4.6 offers the optimal balance for orchestration tasks:

Task decomposition capability: Sonnet is significantly more capable than Haiku at breaking down complex multi-step workflows into coherent subtasks with proper dependency ordering
Cost efficiency: While 2-3x more expensive than Haiku per token, it requires fewer retries and less refinement, reducing total token consumption
Latency: Faster inference than Opus (which would be overkill for orchestration logic), acceptable for most booking workflows
Specialist agent flexibility: Allows specialist agents to also run at Sonnet tier, creating a homogeneous model environment that reduces debugging complexity

Session-based configuration persistence: The model change takes effect on the next terminal session, not the current one. This is intentional—it forces validation in a fresh environment rather than relying on potentially cached state from the current shell session.

What's Next

To fully optimize this orchestrator setup:

Monitor actual file descriptor usage: Add CloudWatch metrics that track /proc/[pid]/fd/ count during peak agent spawning to ensure our tuning is actually necessary
Test cost impact: Run a representative batch of complex booking workflows (multi-step hotel + transport + activity coordination) with Sonnet to establish baseline token usage before scaling
Implement circuit breakers: Add logic in the orchestrator to limit concurrent child agent spawning based on memory/CPU utilization, not just file descriptor limits
Production hardening: Replace StrictHostKeyChecking=no with proper SSH key pinning for the EC2 instance in production deployment

The current setup is validated and ready for integration testing with the full multi-agent pipeline.