Diagnosing and Remediating the JADA Agent Daemon: Service Health, OAuth Token Expiration, and Turn Limit Management

During a routine infrastructure audit on 2026-05-13, we discovered that the JADA orchestrator daemon running on our Lightsail instance (34.239.233.28) was operationally sound but suffered from a critical authentication failure in a dependent service. This post details the diagnostic approach, findings, and remediation strategy.

The Situation

The jada-agent.service systemd unit was running healthily with 11 days of uptime and negligible resource consumption. However, deeper log inspection revealed that the port_sheet_sync.py script—responsible for syncing booking data to Google Sheets every 30 minutes—had been failing since at least May 13 afternoon with a consistent OAuth token error.

Diagnostic Approach: SSH Access via AWS APIs

Since the private key for the jada-key pair wasn't available locally, we bypassed traditional SSH by leveraging the AWS Lightsail API to generate temporary credentials. This pattern avoids the brittleness of key management sprawl:

aws lightsail get-instance-access-details \
  --instance-name jada-agent \
  --region us-east-1

This API call returns a temporary SSH certificate and private key (valid for 60 seconds), allowing one-time access without maintaining a copy of the persistent private key on the development machine. The temporary key was written to disk, used for connection, and immediately deleted post-session.

Why this approach? It reduces the attack surface by eliminating persistent key storage on developer machines while maintaining audit trails through AWS CloudTrail. For orchestrator daemons and CI/CD systems, this pattern is preferable to distributing long-lived keys.

Service Health Findings

Once connected, we collected comprehensive metrics across multiple dimensions:

Uptime and Process State: jada-agent.service has been active since May 10 (3 days). The systemd journal showed clean starts with no restart loops or segmentation faults.
Resource Utilization: CPU averaged 0.65% with no spikes—normal for a 60-second polling loop. Memory consumption was 144MB of 914MB available. Disk usage sat at 6.2GB of 39GB (17%), indicating healthy headroom.
Status Checks: AWS Lightsail's automated health checks (TCP connectivity and instance reachability) reported zero failures in the preceding 2 hours.
Session Activity: The daemon consumed 3 of its 5 daily sessions on May 13. Session 2 completed successfully and generated a meaningful task (blocking issues on the e-signature and crew page generator). Sessions 1 and 3 exited with code 1 after hitting the 30-turn Claude API limit.

The OAuth Token Failure: Root Cause and Impact

The port_sheet_sync.py script failed consistently with:

[port-sheet] token error: HTTP Error 400: Bad Request

Every 30-minute execution since at least 17:00 UTC on May 13 produced this error. This indicates that the stored Google OAuth token (likely a refresh token) had expired or been revoked. The script authenticates to the Google Sheets API to sync booking data captured through the BookingAutomation.gs Apps Script bound to the QOS booking form.

Impact Chain:

Booking data from the Jada form is no longer syncing to the master sheet in real time.
The JADA daemon's task queue depends on clean data flow; incomplete syncs can leave stale task status.
Port forwarding sheet updates—critical for crew logistics—have been blocked for approximately 8+ hours.

The 30-Turn Limit: Context and Trade-offs

Two of today's three agent sessions exited with code 1 after reaching Claude's 30-turn conversation limit. This isn't a service failure; rather, it's a constraint of the agentic orchestration design. Each turn represents a Claude API call; 30 turns is a cost and latency control mechanism.

Why this limit exists: Without it, a single complex task could spiral into 100+ API calls, inflating costs and increasing latency. The limit forces task decomposition and encourages the daemon to offload work into discrete, completable units.

Observed behavior: Session 2 (which completed successfully) processed a task within the turn budget and generated a follow-up task for manual human intervention. Sessions 1 and 3, which hit the limit, likely attempted more complex multi-step orchestrations that exceeded the budget. The daemon continued running and picked up new tasks at the next polling cycle, so there's no hard failure—just incomplete task processing.

Infrastructure Components Involved

Lightsail Instance: jada-agent in us-east-1, running systemd-managed jada-agent.service
Scripts:
- /Users/cb/Documents/repos/tools/auth_ga.py — Google Analytics authentication utility
- port_sheet_sync.py — 30-minute sync loop to Google Sheets API
- BookingAutomation.gs — Apps Script on the QOS Google Form, triggers booking events
AWS Services: Lightsail API for temporary SSH credentials, CloudWatch Metrics for CPU/memory/disk
Google Services: Sheets API, OAuth 2.0 credential flow

Key Decisions Made

Temporary SSH Over Persistent Keys: Using the Lightsail API to generate one-time credentials instead of storing the private key on disk reduces key sprawl and improves auditability.
Metrics-Driven Assessment: Rather than relying on service status alone, we pulled CPU, memory, disk, and status check metrics to paint a complete picture of daemon health.
Token Expiration as a Known Failure Mode: Google OAuth tokens expire by design. The remediation path is to re-authenticate the port_sheet_sync.py script with fresh credentials, likely via a CLI auth flow stored securely in the deployment environment.

Remediation Path Forward

Two action items emerge:

Re-authenticate port_sheet_sync.py: Run the Google OAuth flow for the dangerouscentaur@gmail.com account (or whichever account owns the booking sheet), capture the new refresh token, and update the stored credential in the Lightsail instance environment (likely in a repos.env file or secrets management system).
Monitor 30-Turn Exits: If turn-limit exits become a recurring blocker, consider either increasing the per-session limit, breaking complex tasks into smaller async units, or implementing a resumption mechanism where incomplete tasks are automatically re-queued.

The daemon itself is healthy and performant. The failure is isolated to external credential management—a solvable problem with a clear remediation path.