Diagnosing and Remediating the JADA Agent Daemon: OAuth Token Failure & Turn-Limit Patterns

On 2026-05-13, we conducted a comprehensive health audit of the jada-agent orchestrator daemon running on a Lightsail instance (34.239.233.28). The investigation uncovered one critical OAuth token failure blocking port sheet syncs, plus a recurring pattern of Claude turn-limit exits that warrant architectural review. This post details the diagnostic approach, findings, and remediation path.

What We Did

We performed a multi-layer health check on the jada-agent daemon without a pre-stored SSH key:

  • Remote access via AWS Lightsail API: Obtained temporary SSH credentials programmatically instead of relying on local key storage
  • Service & process inspection: Verified systemd unit status, uptime, CPU/memory/disk utilization, and AWS status checks
  • Log analysis: Reviewed daemon logs, session history, and error patterns over a 24-hour window
  • OAuth token validation: Identified and isolated the root cause of port sheet sync failures
  • Task queue inspection: Confirmed the daemon is correctly picking up tasks and respecting session/turn limits

Technical Details: Diagnostic Approach

Keyless SSH Access via Lightsail API

The jada-key private key is not stored in ~/.ssh/ on the development machine. Rather than forcing key distribution, we used the AWS Lightsail API to fetch temporary SSH credentials:

aws lightsail get-instance-access-details \
  --instance-name jada-agent-prod \
  --region us-east-1

Why this approach: Temporary credentials reduce the attack surface of long-lived keys. The API returns a certificate-based keypair valid for 60 minutes, paired with a public key that's securely injected into the instance's SSH agent. This pattern is more aligned with ephemeral access best practices than storing private keys in repos or dotfiles.

Service Health Metrics

Once connected via SSH, we collected the following via systemd and the Linux /proc filesystem:

systemctl status jada-agent.service
systemctl show jada-agent.service --all
cat /proc/loadavg
free -h
df -h /
aws lightsail get-instance-metric-statistics \
  --instance-name jada-agent-prod \
  --metric-name CPUUtilization \
  --start-time 2026-05-13T14:00:00Z \
  --end-time 2026-05-13T16:00:00Z \
  --period 300 \
  --statistics Average

Results:

  • Service status: active (running), uptime 3 days (since May 10)
  • CPU: 0.65% average, no spikes; load average 0.00
  • Memory: 144 MB / 914 MB (16% utilization)
  • Disk: 6.2 GB / 39 GB (17% used)
  • AWS status checks: 0 failures in the last 2 hours

The daemon is in excellent operational health—CPU idle most of the time, memory footprint minimal, and system resources plentiful.

Session & Task Activity Analysis

The jada-agent daemon manages a daily session quota (5 sessions per UTC day) and enforces a 30-turn limit per Claude API call. We analyzed logs from /var/log/jada-agent/daemon.log and the progress dashboard to understand today's activity:

  • Session 1 (00:00 UTC): Hit max turns (30), exit code 1. Daemon logged this as an error but continued normally.
  • Session 2 (00:02 UTC): Completed successfully. Processed e-signature and crew page blockers, created a task asking for human input on a specific code path.
  • Session 3 (00:05 UTC): Hit max turns (30), exit code 1. After this, the daemon found no pending tasks and returned to idle state.

Yesterday's pattern showed a 5/5 session hard stop before midnight UTC with 3 pending tasks queued. Those tasks were cleared at the 00:00 UTC rollover—expected behavior given the daily quota reset.

Critical Finding: port_sheet_sync.py OAuth Token Failure

Every 30-minute sync job scheduled by the daemon has failed since at least May 13 afternoon with the error:

[port-sheet] token error: HTTP Error 400: Bad Request

This indicates the Google OAuth 2.0 token stored for the port sheet sync script is either expired or has been revoked. The token is located at /var/secrets/jada/port_sheet_token.json and is referenced in /opt/jada/scripts/port_sheet_sync.py.

Impact: Port sheet syncs are not running. Any automation that depends on timely port sheet data from Google Sheets is stale.

Root cause: Google OAuth tokens have a default lifetime of ~1 hour for access tokens and ~6 months for refresh tokens (if offline access was granted). If the refresh token was revoked (e.g., password change, explicit token revocation), the daemon cannot obtain a new access token and the sync fails silently in the cron loop.

Infrastructure & Architecture

Daemon Process Architecture

The jada-agent daemon runs as a systemd service (jada-agent.service) on the Lightsail instance. Its core loop:

  1. Poll the progress dashboard task queue every 60 seconds
  2. If a task is pending and sessions remain in today's quota, claim the task and invoke Claude via the API
  3. Stream Claude's response back to the dashboard, updating task status
  4. Exit when either the task completes, max turns (30) is hit, or an error occurs
  5. Return to idle; repeat the poll loop

The 30-turn limit is enforced at the Claude API integration layer to prevent runaway costs and infinite loops. A "turn" is one request-response pair in the conversation.

Session & Token Management

Google OAuth tokens for background scripts (like port sheet sync) are stored in /var/secrets/jada/ and loaded at script startup. The daemon does not auto-refresh these tokens; it assumes they are valid. If a token expires or is revoked, the script fails and logs an error without retrying or alerting.

Key Decisions

  • Keyless SSH access: We opted to use the Lightsail API for ephemeral credentials rather than distributing a long-lived private key. This reduces the blast radius of key compromise.
  • No automated token refresh: The current design assumes OAuth tokens are valid at runtime. This is a weak point; background scripts should implement refresh logic or use service account credentials where applicable.
  • Turn limit as a circuit breaker: The 30-turn limit is intentionally strict to prevent runaway costs. Sessions 1 and 3 hit this limit because their tasks required iterative refinement. Rather