Diagnosing and Resolving the JADA Agent Daemon: OAuth Token Failures and Max-Turn Limits in Production
During a routine health check of the jada-agent orchestrator daemon running on AWS Lightsail instance 34.239.233.28, we discovered a critical OAuth token failure in the port sheet sync job and identified a recurring pattern of agent sessions hitting Claude's 30-turn limit. This post details the diagnosis process, infrastructure patterns used, and the remediation strategy.
What Was Done
- Established SSH access to the Lightsail instance via AWS SSM Session Manager and temporary key provisioning (jada-key pair)
- Collected comprehensive daemon health metrics: service uptime (11 days), CPU/memory/disk utilization, and status check history
- Analyzed daemon logs for the past 24 hours, identifying three agent sessions and their outcomes
- Isolated a broken Google OAuth token in
port_sheet_sync.pycausing repeated 400 Bad Request errors every 30 minutes - Documented the max-turns exit pattern: two of three sessions hit Claude's 30-turn limit (exit code 1) before task completion
- Validated that the daemon itself remains stable; exit code 1 is logged but doesn't crash the service
Technical Details: Daemon Architecture and Session Management
The jada-agent daemon is deployed as a systemd service on the Lightsail instance and manages a queue of tasks from a progress dashboard. The daemon operates in a 60-second polling loop, checking for queued work and spawning agent sessions via Claude's API. Each session has a hard limit of 30 turns (request-response pairs).
Session activity for 2026-05-13 (UTC):
- Session 1 (00:00): Exited with code 1 after hitting 30-turn max. Task incomplete.
- Session 2 (00:02): Completed successfully. Processed e-signature page blockers and crew page generator code, created a needs-you task for manual review.
- Session 3 (00:05): Exited with code 1 after hitting 30-turn max. Task incomplete.
- After session 3, the daemon found no pending tasks and entered idle state (load average 0.00).
Why this matters: Sessions 1 and 3 didn't fail due to daemon instability; they exhausted Claude's conversation turn limit. This suggests the tasks queued require more than 30 turns to resolve—either the task scope is too broad, or the agent's reasoning is inefficient. Session 2's success shows the daemon can complete work within limits when scope is appropriate.
Critical Issue: Google OAuth Token Failure in port_sheet_sync
The logs reveal a persistent error in the port_sheet_sync.py script, which runs every 30 minutes as a scheduled task:
[port-sheet] token error: HTTP Error 400: Bad Request
This script is responsible for syncing booking and crew data to Google Sheets. The OAuth token stored for the dangerouscentaur@gmail.com Google account—used by the auth_ga.py authentication module—has either expired or been revoked. Importantly, this is not the GA4 analytics token; it's a separate credential for Google Sheets API access.
Root cause: Google OAuth tokens have a typical lifetime of 3600 seconds (1 hour). Refresh tokens can extend this, but if the refresh token is missing, expired, or if the user revoked access from Google's security settings, the token becomes invalid.
Current impact: Port sheet syncs have not run since at least afternoon UTC on 2026-05-13. This means any booking data changes are not reflected in the Google Sheet used for crew scheduling and manual review.
Infrastructure: AWS Lightsail and SSH Access Pattern
The jada-agent daemon runs on a Lightsail instance in the default VPC. Because SSH keys were not stored in the local ~/.ssh/ directory, we used AWS's temporary credential mechanism:
- Called the Lightsail
GetInstanceAccessDetailsAPI to generate a temporary SSH key pair - Written the private key to a temporary file with restricted permissions (
chmod 600) - Connected via SSH using the ephemeral key
- Cleaned up the temporary key immediately after session completion
This pattern is more secure than storing the persistent jada-key locally and provides an audit trail via CloudTrail for any instance access.
Metrics and Health Summary
- Uptime: 11 days (healthy—instance hasn't required restart)
- Load average: 0.00 (between tasks)
- CPU: ~0.65% average during polling, no sustained spikes
- Memory: 144MB / 914MB (16% utilization—normal for a lightweight Python daemon)
- Disk: 6.2GB / 39GB (17% used, 83% free—no capacity concerns)
- Status checks: 0 failures in the past 2 hours (system is stable)
- Session quota: 3 of 5 sessions used today (1 slot available before rolling limit resets)
Key Decisions and Reasoning
Why we diagnosed before acting: The daemon logs showed clear patterns—two max-turn failures and one success—plus the port_sheet_sync error recurring every 30 minutes. Rather than restarting the service (a common first response), we gathered metrics to distinguish between daemon instability and task-scope issues. The data showed the daemon itself is healthy; the problems are token expiration and task complexity.
Why temporary SSH keys over persistent keys: Storing SSH keys on a development machine (especially jada-key) increases the surface area for key compromise. Using Lightsail's API to issue ephemeral keys is more aligned with infrastructure-as-code practices and leaves an audit trail in CloudTrail. This is especially important for production orchestration daemons.
Why port_sheet_sync requires immediate re-authentication: The Google Sheets sync is a dependency for booking automation and crew scheduling. If it fails silently for more than a few hours, manual processes don't have fresh data, and the booking pipeline degrades.
What's Next
- Re-authenticate Google OAuth for port_sheet_sync: Run
auth_ga.py --account dangerouscentaur@gmail.com(after fixing the file path issue) to generate a fresh OAuth token. Store the refresh token in the secrets directory so the script can silently refresh without manual intervention. - Investigate max-turn exits: Review the queued tasks from sessions 1 and 3. Determine if they can be split into smaller, independently completable sub-tasks. This reduces the chance of hitting the 30-turn limit and improves task resilience.
- Add alerting: Monitor for repeated exit code 1 from the daemon and port_sheet_sync errors. Set a CloudWatch alarm or Lightsail monitoring alert to notify on-call when syncs fail for more than 60 minutes.
- Document session limits: Add comments to the daemon task queuing logic explaining the 30-turn constraint and guidance for task authors to keep scopes focused.