Diagnosing and Remediating the JADA Agent Daemon: Service Health, OAuth Token Expiration, and Turn Limit Management
During a routine infrastructure audit on 2026-05-13, we discovered that the JADA orchestrator daemon running on our Lightsail instance (34.239.233.28) was operationally sound but suffered from a critical authentication failure in a dependent service. This post details the diagnostic approach, findings, and remediation strategy.
The Situation
The jada-agent.service systemd unit was running healthily with 11 days of uptime and negligible resource consumption. However, deeper log inspection revealed that the port_sheet_sync.py script—responsible for syncing booking data to Google Sheets every 30 minutes—had been failing since at least May 13 afternoon with a consistent OAuth token error.
Diagnostic Approach: SSH Access via AWS APIs
Since the private key for the jada-key pair wasn't available locally, we bypassed traditional SSH by leveraging the AWS Lightsail API to generate temporary credentials. This pattern avoids the brittleness of key management sprawl:
aws lightsail get-instance-access-details \
--instance-name jada-agent \
--region us-east-1
This API call returns a temporary SSH certificate and private key (valid for 60 seconds), allowing one-time access without maintaining a copy of the persistent private key on the development machine. The temporary key was written to disk, used for connection, and immediately deleted post-session.
Why this approach? It reduces the attack surface by eliminating persistent key storage on developer machines while maintaining audit trails through AWS CloudTrail. For orchestrator daemons and CI/CD systems, this pattern is preferable to distributing long-lived keys.
Service Health Findings
Once connected, we collected comprehensive metrics across multiple dimensions:
- Uptime and Process State:
jada-agent.servicehas been active since May 10 (3 days). The systemd journal showed clean starts with no restart loops or segmentation faults. - Resource Utilization: CPU averaged 0.65% with no spikes—normal for a 60-second polling loop. Memory consumption was 144MB of 914MB available. Disk usage sat at 6.2GB of 39GB (17%), indicating healthy headroom.
- Status Checks: AWS Lightsail's automated health checks (TCP connectivity and instance reachability) reported zero failures in the preceding 2 hours.
- Session Activity: The daemon consumed 3 of its 5 daily sessions on May 13. Session 2 completed successfully and generated a meaningful task (blocking issues on the e-signature and crew page generator). Sessions 1 and 3 exited with code 1 after hitting the 30-turn Claude API limit.
The OAuth Token Failure: Root Cause and Impact
The port_sheet_sync.py script failed consistently with:
[port-sheet] token error: HTTP Error 400: Bad Request
Every 30-minute execution since at least 17:00 UTC on May 13 produced this error. This indicates that the stored Google OAuth token (likely a refresh token) had expired or been revoked. The script authenticates to the Google Sheets API to sync booking data captured through the BookingAutomation.gs Apps Script bound to the QOS booking form.
Impact Chain:
- Booking data from the Jada form is no longer syncing to the master sheet in real time.
- The JADA daemon's task queue depends on clean data flow; incomplete syncs can leave stale task status.
- Port forwarding sheet updates—critical for crew logistics—have been blocked for approximately 8+ hours.
The 30-Turn Limit: Context and Trade-offs
Two of today's three agent sessions exited with code 1 after reaching Claude's 30-turn conversation limit. This isn't a service failure; rather, it's a constraint of the agentic orchestration design. Each turn represents a Claude API call; 30 turns is a cost and latency control mechanism.
Why this limit exists: Without it, a single complex task could spiral into 100+ API calls, inflating costs and increasing latency. The limit forces task decomposition and encourages the daemon to offload work into discrete, completable units.
Observed behavior: Session 2 (which completed successfully) processed a task within the turn budget and generated a follow-up task for manual human intervention. Sessions 1 and 3, which hit the limit, likely attempted more complex multi-step orchestrations that exceeded the budget. The daemon continued running and picked up new tasks at the next polling cycle, so there's no hard failure—just incomplete task processing.
Infrastructure Components Involved
- Lightsail Instance:
jada-agentinus-east-1, running systemd-managedjada-agent.service - Scripts:
/Users/cb/Documents/repos/tools/auth_ga.py— Google Analytics authentication utilityport_sheet_sync.py— 30-minute sync loop to Google Sheets APIBookingAutomation.gs— Apps Script on the QOS Google Form, triggers booking events
- AWS Services: Lightsail API for temporary SSH credentials, CloudWatch Metrics for CPU/memory/disk
- Google Services: Sheets API, OAuth 2.0 credential flow
Key Decisions Made
- Temporary SSH Over Persistent Keys: Using the Lightsail API to generate one-time credentials instead of storing the private key on disk reduces key sprawl and improves auditability.
- Metrics-Driven Assessment: Rather than relying on service status alone, we pulled CPU, memory, disk, and status check metrics to paint a complete picture of daemon health.
- Token Expiration as a Known Failure Mode: Google OAuth tokens expire by design. The remediation path is to re-authenticate the
port_sheet_sync.pyscript with fresh credentials, likely via a CLI auth flow stored securely in the deployment environment.
Remediation Path Forward
Two action items emerge:
- Re-authenticate
port_sheet_sync.py: Run the Google OAuth flow for thedangerouscentaur@gmail.comaccount (or whichever account owns the booking sheet), capture the new refresh token, and update the stored credential in the Lightsail instance environment (likely in arepos.envfile or secrets management system). - Monitor 30-Turn Exits: If turn-limit exits become a recurring blocker, consider either increasing the per-session limit, breaking complex tasks into smaller async units, or implementing a resumption mechanism where incomplete tasks are automatically re-queued.
The daemon itself is healthy and performant. The failure is isolated to external credential management—a solvable problem with a clear remediation path.