Orchestrator Daemon Health Monitoring & OAuth Token Recovery: Diagnosing the jada-agent Service
During a routine infrastructure health check, we discovered that the jada-agent orchestrator daemon on our Lightsail instance (34.239.233.28) was running smoothly, but a critical dependency—the Google OAuth token for port sheet synchronization—had expired. This post walks through the diagnostic approach, infrastructure patterns, and remediation strategy we implemented.
What Was Done
We performed a comprehensive health audit of the jada-agent.service running on a dedicated AWS Lightsail instance, uncovering:
- Service Status: Active and healthy with 3 days of uptime, stable resource utilization
- Token Failure: Google OAuth token for
port_sheet_sync.pybroken; 30-minute sync intervals failing with HTTP 400 errors - Agent Execution: Task queue processing healthy; two of three today's sessions hit the 30-turn Claude limit (by design), one completed successfully
- Infrastructure Metrics: CPU 0.65% average, memory 144MB/914MB, disk 6.2GB/39GB—all nominal
Technical Details: Diagnostic Workflow
Since the SSH private key was not stored locally at the standard ~/.ssh/jada-key path, we used a multi-pronged approach:
1. AWS Systems Manager Session Manager + Lightsail API
Rather than hunting for a missing private key, we leveraged the Lightsail API to request temporary SSH credentials:
aws lightsail get-instance-access-details \
--instance-name jada-agent-orchestrator \
--region us-east-1
This returned a temporary certificate and private key valid for 60 seconds, which we wrote to a temporary file and immediately used for SSH connection. The rationale: AWS-managed credentials are auditable, time-bound, and don't require storing long-lived keys on the workstation.
2. Service Health Collection
Once connected, we gathered health signals via:
systemctl status jada-agent.service— confirmed active, running since May 10, 3-day uptimejournalctl -u jada-agent.service -n 100— reviewed last 100 log lines for errorsps aux | grep jada— verified daemon process was consuming reasonable CPU/memoryfree -handdf -h— confirmed sufficient RAM and disk space
3. Cloudwatch Metrics via Lightsail API
We pulled 2-hour historical metrics directly from AWS to avoid relying solely on in-instance tools:
aws lightsail get-instance-metric-data \
--instance-name jada-agent-orchestrator \
--metric-name CPUUtilization \
--statistics Average \
--start-time 2026-05-13T15:00:00Z \
--end-time 2026-05-13T17:00:00Z \
--period 60
This confirmed no CPU spikes or anomalous load patterns over the observation window.
Infrastructure Architecture
The jada-agent orchestrator follows a stateless daemon + task queue pattern:
- Compute: AWS Lightsail instance running systemd-managed
jada-agent.service - Task Queue: External progress dashboard polled every cycle (no local queue persistence—tasks are fetched on-demand)
- Session Management: Claude API with 30-turn-per-session limits; multiple short sessions preferred over single long-running sessions for fault isolation
- Dependent Integrations: Google Sheets API (via OAuth token in
port_sheet_sync.py), S3 deployments, CloudFront invalidations
The daemon's idle baseline (0.00 load average) indicates the polling loop's 60-second interval is efficient; CPU only spikes when tasks are dequeued and processed.
Critical Finding: Google OAuth Token Expiration
The port_sheet_sync.py script, scheduled to run every 30 minutes, has been failing with consistent HTTP 400 errors in its OAuth token refresh attempt. This indicates the stored Google OAuth token (managed via auth_ga.py and referenced in the local credentials store) is either:
- Expired and failing silent refresh (typical if the refresh token is invalid)
- Revoked by the user or Google
- Associated with a client ID/secret pair that no longer has valid permissions
The affected script path: /Users/cb/Documents/repos/tools/auth_ga.py (local workstation) manages token lifecycle. The production instance stores a cached token that is no longer valid.
Session Execution Analysis
Today's three agent sessions show expected behavior:
| Session | Time (UTC) | Exit Code | Notes |
|---|---|---|---|
| 1 | 00:00 | 1 (max turns) | Hit 30-turn limit; complex task |
| 2 | 00:02 | 0 (success) | Completed; e-sig page + crew generator blockers resolved |
| 3 | 00:05 | 1 (max turns) | Complex multi-step task; turned limit again |
Sessions 1 and 3 exiting with code 1 are not crashes—they're intentional halts when the 30-turn budget is exhausted. This is a feature, not a bug: it prevents runaway token consumption and forces task decomposition. However, if tasks remain incomplete after hitting this limit, we may need to increase the turn budget or refactor task scope.
Key Decisions
Why SSH via Lightsail API Instead of Stored Keys
Storing long-lived SSH private keys on workstations introduces rotation and revocation complexity. By requesting temporary credentials via the Lightsail API, we:
- Eliminate the need to manage persistent key material on disk
- Create