Diagnosing and Remediating the JADA Agent Daemon: OAuth Token Expiration and Turn-Limit Patterns

On 2026-05-13, we conducted a comprehensive health audit of the JADA agent orchestrator daemon running on our Lightsail instance (34.239.233.28). While the daemon itself is stable and performant, we discovered a critical OAuth token expiration affecting the port_sheet_sync.py script and identified a recurring pattern where complex agent tasks are hitting Claude's 30-turn conversation limit. This post details the diagnostic approach, findings, and remediation strategy.

The Diagnostic Challenge: Accessing a Secured Instance Without Stored Keys

The initial constraint was that the JADA SSH private key was not stored in the local ~/.ssh directory. Rather than maintaining plaintext keys in distributed locations, we leverage AWS Lightsail's built-in temporary credential API.

Approach:

  • Query the Lightsail API endpoint for GetInstanceAccessDetails with the instance name jada-agent
  • Parse the response to extract the temporary SSH certificate and extract the protocol details
  • Write the ephemeral key to a secure temporary file with chmod 600 permissions
  • Establish an SSH session and immediately remove the temporary credentials after use

Why this matters: This pattern eliminates long-lived SSH keys from developer machines. The Lightsail API generates certificates valid for only 60 seconds, reducing the blast radius if a session is compromised. For CI/CD and daemon-to-daemon communication, we're shifting toward AWS Systems Manager Session Manager, which provides full audit logging through CloudTrail.

JADA Agent Health: The Good News

The daemon is fundamentally healthy:

  • Service Status: jada-agent.service has been running continuously since May 10 (3 days uptime)
  • Resource Utilization: CPU averaging 0.65% (within normal bounds for a polling loop), memory consumption at 144MB of 914MB available
  • System Health: AWS Lightsail status checks show 0 failures in the last 2 hours; disk usage at 17% of capacity
  • Task Processing: The daemon successfully consumed 3 of 5 daily sessions (UTC), including one high-value completion that generated a needs-you task for manual crew page code review

Instance load average is near-zero between tasks, indicating the polling mechanism in /etc/systemd/system/jada-agent.service is sleeping appropriately rather than busy-waiting.

The Critical Issue: Port Sheet Sync OAuth Token Expiration

Every 30-minute execution of port_sheet_sync.py has been failing with the same error signature:

[port-sheet] token error: HTTP Error 400: Bad Request

Root Cause: The Google OAuth2 refresh token used by port_sheet_sync.py has expired or been revoked. This script is responsible for syncing crew scheduling data from Google Sheets into our internal port sheet database.

Impact: No port sheet syncs have completed since at least May 13 afternoon. Any updates to the crew scheduling sheet are not being reflected in downstream systems that depend on this sync.

Technical Details: The script uses the google-auth-oauthlib library to authenticate against the Google Sheets API v4. The OAuth2 flow is initialized in /Users/cb/Documents/repos/tools/auth_ga.py (or a similar authentication utility). When the refresh token expires, the library attempts to renew the access token but receives a 400 Bad Request from Google's token endpoint, indicating the refresh token is no longer valid.

Remediation Path:

  • Re-run the OAuth2 authentication flow to obtain a fresh token pair (access token + refresh token)
  • Store the new credentials securely in the secrets backend (AWS Secrets Manager or encrypted environment variables)
  • Update the port_sheet_sync.py script to load credentials from the secrets backend rather than a local file
  • Trigger a manual sync run to confirm the token is valid
  • Monitor the next 3–5 scheduled sync cycles (every 30 minutes) to confirm consistent success

Secondary Pattern: Agent Turn Limit Exits

Two of today's three agent sessions exited with code 1, and the daemon logs these as errors:

  • Session 1 (00:00 UTC): Exited after hitting max 30 turns — complex task not fully completed
  • Session 2 (00:02 UTC): Completed successfully within turn budget — processed e-signature blockers and crew page generation
  • Session 3 (00:05 UTC): Exited after hitting max 30 turns — task scope exceeded available turns

Why This Happens: Claude's API enforces a maximum conversation length (30 turns in our configuration). For complex multi-step tasks requiring iterative refinement, planning, or error recovery, the agent can exhaust this budget before achieving task completion.

This is Not a Crash: The daemon correctly logs the exit code as a warning but does not crash. Tasks can be re-queued or continued in a subsequent session. However, if high-value tasks are consistently hitting this limit, we have two options:

  • Increase the turn limit (if API costs and latency are acceptable)
  • Refactor task decomposition in the agent's prompt engineering to produce more focused subtasks that fit within 30 turns

Next Steps: We should instrument the agent logs to track turn consumption per task type. If certain task categories consistently exceed 25 turns, those are candidates for prompt optimization or task-splitting logic.

Infrastructure and Monitoring

The JADA agent runs on an AWS Lightsail instance with the following configuration:

  • Instance: jada-agent at 34.239.233.28
  • Service: Managed via systemd unit file at /etc/systemd/system/jada-agent.service
  • Metrics: AWS Lightsail CloudWatch integration provides CPU, memory, network, and status check metrics
  • SSH Access: Controlled via Lightsail's temporary credential API; no persistent keys stored on developer machines

We're pulling CPU utilization and network metrics via the Lightsail API to supplement the daemon's internal logs. This hybrid approach (daemon logs + AWS metrics) gives us both application-level and infrastructure-level visibility.

Key Decisions and Trade-offs

Why Temporary SSH Credentials Instead of Persistent Keys: Persistent SSH keys, if leaked, provide indefinite access. Temporary credentials expire after 60 seconds and require valid AWS credentials to regenerate. This reduces operational risk in a team environment.

Why OAuth Re-authentication is Necessary: Google's OAuth2 tokens have fixed lifespans. Refresh tokens typically last months but can be revoked if the user changes their password, enables 2FA, or revokes app permissions. For long