Orchestrator Daemon Health Monitoring & OAuth Token Recovery: Diagnosing the jada-agent Service

```html

During a routine infrastructure health check, we discovered that the jada-agent orchestrator daemon on our Lightsail instance (34.239.233.28) was running smoothly, but a critical dependency—the Google OAuth token for port sheet synchronization—had expired. This post walks through the diagnostic approach, infrastructure patterns, and remediation strategy we implemented.

What Was Done

We performed a comprehensive health audit of the jada-agent.service running on a dedicated AWS Lightsail instance, uncovering:

Service Status: Active and healthy with 3 days of uptime, stable resource utilization
Token Failure: Google OAuth token for port_sheet_sync.py broken; 30-minute sync intervals failing with HTTP 400 errors
Agent Execution: Task queue processing healthy; two of three today's sessions hit the 30-turn Claude limit (by design), one completed successfully
Infrastructure Metrics: CPU 0.65% average, memory 144MB/914MB, disk 6.2GB/39GB—all nominal

Technical Details: Diagnostic Workflow

Since the SSH private key was not stored locally at the standard ~/.ssh/jada-key path, we used a multi-pronged approach:

1. AWS Systems Manager Session Manager + Lightsail API

Rather than hunting for a missing private key, we leveraged the Lightsail API to request temporary SSH credentials:

aws lightsail get-instance-access-details \
  --instance-name jada-agent-orchestrator \
  --region us-east-1

This returned a temporary certificate and private key valid for 60 seconds, which we wrote to a temporary file and immediately used for SSH connection. The rationale: AWS-managed credentials are auditable, time-bound, and don't require storing long-lived keys on the workstation.

2. Service Health Collection

Once connected, we gathered health signals via:

systemctl status jada-agent.service — confirmed active, running since May 10, 3-day uptime
journalctl -u jada-agent.service -n 100 — reviewed last 100 log lines for errors
ps aux | grep jada — verified daemon process was consuming reasonable CPU/memory
free -h and df -h — confirmed sufficient RAM and disk space

3. Cloudwatch Metrics via Lightsail API

We pulled 2-hour historical metrics directly from AWS to avoid relying solely on in-instance tools:

aws lightsail get-instance-metric-data \
  --instance-name jada-agent-orchestrator \
  --metric-name CPUUtilization \
  --statistics Average \
  --start-time 2026-05-13T15:00:00Z \
  --end-time 2026-05-13T17:00:00Z \
  --period 60

This confirmed no CPU spikes or anomalous load patterns over the observation window.

Infrastructure Architecture

The jada-agent orchestrator follows a stateless daemon + task queue pattern:

Compute: AWS Lightsail instance running systemd-managed jada-agent.service
Task Queue: External progress dashboard polled every cycle (no local queue persistence—tasks are fetched on-demand)
Session Management: Claude API with 30-turn-per-session limits; multiple short sessions preferred over single long-running sessions for fault isolation
Dependent Integrations: Google Sheets API (via OAuth token in port_sheet_sync.py), S3 deployments, CloudFront invalidations

The daemon's idle baseline (0.00 load average) indicates the polling loop's 60-second interval is efficient; CPU only spikes when tasks are dequeued and processed.

Critical Finding: Google OAuth Token Expiration

The port_sheet_sync.py script, scheduled to run every 30 minutes, has been failing with consistent HTTP 400 errors in its OAuth token refresh attempt. This indicates the stored Google OAuth token (managed via auth_ga.py and referenced in the local credentials store) is either:

Expired and failing silent refresh (typical if the refresh token is invalid)
Revoked by the user or Google
Associated with a client ID/secret pair that no longer has valid permissions

The affected script path: /Users/cb/Documents/repos/tools/auth_ga.py (local workstation) manages token lifecycle. The production instance stores a cached token that is no longer valid.

Session Execution Analysis

Today's three agent sessions show expected behavior:

Session	Time (UTC)	Exit Code	Notes
1	00:00	1 (max turns)	Hit 30-turn limit; complex task
2	00:02	0 (success)	Completed; e-sig page + crew generator blockers resolved
3	00:05	1 (max turns)	Complex multi-step task; turned limit again

Sessions 1 and 3 exiting with code 1 are not crashes—they're intentional halts when the 30-turn budget is exhausted. This is a feature, not a bug: it prevents runaway token consumption and forces task decomposition. However, if tasks remain incomplete after hitting this limit, we may need to increase the turn budget or refactor task scope.

Key Decisions

Why SSH via Lightsail API Instead of Stored Keys

Storing long-lived SSH private keys on workstations introduces rotation and revocation complexity. By requesting temporary credentials via the Lightsail API, we:

Eliminate the need to manage persistent key material on disk
Create