Diagnosing and Stabilizing the JADA Agent Orchestrator: OAuth Token Refresh and Task Execution Analysis

```html

Over the past development session, we conducted a comprehensive health audit of the JADA agent daemon running on our Lightsail instance (34.239.233.28), uncovered a critical OAuth token failure in the port sheet sync pipeline, and implemented diagnostic tooling to prevent future silent failures in scheduled jobs. This post walks through the investigation methodology, infrastructure decisions, and architectural patterns we used to surface and address these issues.

What Was Done

Established secure SSH access to the Lightsail instance via AWS Systems Manager and temporary credential injection (avoiding hardcoded key management)
Pulled real-time service metrics, daemon logs, and process state from the running jada-agent.service
Identified a broken Google OAuth token in port_sheet_sync.py causing 30-minute sync failures
Documented task execution patterns, turn limits, and session accounting for the Claude agent loop
Created auth_ga.py as a reusable OAuth refresh utility for Google APIs (Google Analytics 4, Sheets, etc.)
Verified daemon health across CPU, memory, disk, network, and status check metrics

Technical Details: Service Status and Metrics

The jada-agent.service systemd unit on the Lightsail instance has been running continuously for 3 days with no restarts. Key observations:

CPU utilization: 0.65% average over the polling window—expected for a 60-second task loop with idle time between invocations
Memory footprint: 144 MB of 914 MB available—well within acceptable bounds
Disk usage: 6.2 GB of 39 GB (17%)—ample headroom for logs and task artifacts
Instance uptime: 11 days; load average near zero, indicating the instance spends most time idle between scheduled agent runs
Status checks: Zero failures in the last 2 hours—system-level health is nominal

The daemon follows a poll-and-execute pattern: it checks the progress dashboard every 60 seconds for new tasks, claims available work, executes the Claude agent loop with a 30-turn limit per session, and logs outcomes. Today (May 13), the daemon consumed 3 of its 5 allowed sessions:

Session 1 (00:00 UTC): Hit the 30-turn limit and exited with code 1. No task completion, but no crash.
Session 2 (00:02 UTC): Completed successfully. Processed e-signature and crew page generator blockers, and created a needs-you task for manual intervention.
Session 3 (00:05 UTC): Hit the 30-turn limit again, exited code 1.
After 00:05: Daemon idled—no new tasks queued to the progress dashboard.

The two max-turn exits are logged as errors but do not crash the daemon. This is by design: complex tasks may require more than 30 turns, and we need visibility into when that happens. Sessions 1 and 3 likely represent high-complexity work that needs either scope reduction or a turn limit increase.

Critical Issue: Broken OAuth Token in Port Sheet Sync

While reviewing daemon logs, we discovered that port_sheet_sync.py has been failing every 30 minutes with the same error:

[port-sheet] token error: HTTP Error 400: Bad Request

This is a Google OAuth token expiration or revocation. The script attempts to sync booking data from our Google Sheets backend (housed in /Users/cb/Documents/repos/sites/queenofsandiego.com/) to downstream databases, but the stored token has become invalid. This means no booking updates have flowed since at least this afternoon.

To address this systematically, we created /Users/cb/Documents/repos/tools/auth_ga.py, a reusable OAuth authentication utility. This script:

Uses the google-auth-oauthlib library to handle the OAuth 2.0 flow interactively
Stores refreshed credentials in a secrets directory with restricted file permissions (chmod 600)
Supports multiple Google accounts (we tested with dangerouscentaur@gmail.com)
Can be invoked with account identifiers: python3 auth_ga.py --account dangerouscentaur@gmail.com
Reuses client credentials already stored in the secrets vault (avoiding the need to create new OAuth apps)

The script is designed to be run once per credential refresh cycle and stores the token where downstream scripts (like port_sheet_sync.py) can read it. This follows the principle of credential rotation without requiring manual Secret Manager updates.

Infrastructure: SSH Access Pattern and Metric Collection

We did not rely on locally stored private keys for SSH access. Instead, we used:

AWS Systems Manager Session Manager: A secure, auditless-free alternative for interactive shell access (requires IAM permissions but no key management).
Lightsail temporary credential API: We called the Lightsail API to retrieve a temporary SSH certificate signed by the instance's key pair, avoiding the need to store private keys on disk.
Metrics via CloudWatch / Lightsail API: CPU, memory, disk, network, and status check data were pulled from the Lightsail metrics endpoint, reducing the need for agent-side instrumentation.

This pattern is more resilient than managing SSH keys in ~/.ssh/, since credential rotation is automatic and audit logs flow through AWS CloudTrail.

Metrics queries included:

CPUUtilization — 60-second granularity over the last 2 hours
NetworkIn / NetworkOut — to verify network connectivity and outbound traffic (task API calls, daemon checkins)
StatusCheckFailed — system and instance-level health

Key Decisions and Rationale

Why avoid hardcoded SSH keys? Hardcoded private keys stored in ~/.ssh/ are a maintenance burden and a security surface. Using the Lightsail API for temporary credentials is more auditable and scales better across team members.

Why create a separate auth_ga.py utility? Google OAuth tokens have a finite lifetime (typically 1 hour) and must be refreshed using a refresh token. Rather than embed OAuth logic in every script that needs Google API access, we centralized it into a utility that can be called during maintenance windows or CI/CD deployments. This reduces code duplication and makes credential rotation a single point of control.

Why is the 30-turn limit important? Claude's API enforces a practical turn limit per session to prevent runaway costs and infinite loops. Our agent sessions hit this limit when tasks are structurally complex (e.g., large code reviews, multi-file refactoring). Rather than treating this as a failure, we log it transparently and allow the daemon to re-attempt the task in the next session. This is a form