Diagnosing and Remediating the JADA Agent Daemon: Service Health, OAuth Token Expiration, and Turn Limit Constraints

Over the past development session, we conducted a comprehensive health audit of the JADA agent orchestrator daemon running on AWS Lightsail instance 34.239.233.28. The investigation revealed a fundamentally stable service with three distinct issues: an expired Google OAuth token blocking port sheet synchronization, recurrent Claude API turn-limit exits on complex tasks, and the need for infrastructure-level credential management improvements. This post details the diagnostic process, findings, and remediation steps.

Why a Daemon Health Audit Was Necessary

The JADA agent system is a critical orchestration layer that autonomously processes tasks from a progress dashboard, manages multi-step workflows, and integrates with external APIs (Google Analytics, Google Sheets, etc.). When a daemon goes silent or begins failing silently, tasks queue up and downstream workflows break. The goal of this audit was to establish a baseline of service health, identify failure modes, and quantify the extent of any data loss or stalled processing.

Infrastructure: Lightsail Instance and Service Architecture

The JADA daemon runs as a systemd service on a single AWS Lightsail instance. The instance configuration:

Instance ID: 34.239.233.28 (11 days uptime at time of audit)
Resource Allocation: 914MB RAM, 39GB disk, shared CPU with 0.65% average utilization
Service Name: jada-agent.service (systemd unit)
Status Checks: 0 failures in the preceding 2 hours; instance is healthy at the infrastructure layer
SSH Access: Provisioned via AWS Lightsail API temporary key generation (no permanently stored private key on local machines)

This architecture is intentionally minimal: the daemon is stateless, relies on external task queues (the progress dashboard) for work, and writes logs to the instance's local systemd journal. Metrics are pulled via the Lightsail API (GetInstanceMetricStatistics) rather than a persistent monitoring agent.

Diagnostic Process and Key Commands

Access to the instance required working around a missing SSH key pair. The initial approach—searching for jada-key in ~/.ssh and repos.env—failed because the private key is not persisted locally. Instead, we used the AWS Lightsail API to generate temporary SSH credentials:

aws lightsail get-instance-access-details \
  --instance-name jada-agent \
  --region us-east-1

This returned a temporary SSH certificate and host certificate; we wrote the private key to a temporary file, connected, and immediately removed the key afterward to minimize credential exposure.

Once connected, we collected daemon health via standard Linux diagnostics:

systemctl status jada-agent.service — confirmed active/running since May 10
journalctl -u jada-agent.service -n 100 — last 100 log entries for recent errors
top -b -n 1 — snapshot of CPU, memory, and process list
df -h — disk utilization (6.2GB of 39GB in use)
AWS Lightsail API calls for CPU, network, and status check metrics over the preceding 2 hours

Findings: What We Learned

1. Service is Fundamentally Healthy

The jada-agent.service is running normally with 3 days of uptime. The instance has not crashed, restarted unexpectedly, or exhibited resource exhaustion. Load average is near zero between task runs, indicating the daemon is idle-looping correctly.

2. Session Activity and Turn-Limit Constraints

Over May 13, the daemon executed three sessions:

Session 1 (00:00 UTC): Hit the 30-turn Claude API limit and exited with code 1. No work was completed.
Session 2 (00:02 UTC): Completed successfully. Processed e-signature page blockers and created a follow-up task.
Session 3 (00:05 UTC): Hit the 30-turn limit again. Exited with code 1.
After Session 3: No new tasks available; daemon resumed idle-loop.

The turn-limit exits are not crashes; the daemon logs them as errors but continues running and polling for new tasks. However, they represent incomplete work. Session 1 and Session 3 were killed mid-execution, leaving tasks in an undefined state.

Why this happens: Complex, multi-step tasks (like debugging JavaScript booking widgets, generating new content pages, or refactoring site layouts) can require more than 30 turns of back-and-forth reasoning with Claude. When the limit is hit, the session aborts. The task remains in the progress dashboard but is not marked as failed—it simply stops advancing.

3. Critical Issue: Expired Google OAuth Token

The most actionable finding: the Google OAuth token used by port_sheet_sync.py (located at /home/ubuntu/jada/port_sheet_sync.py on the instance) is expired or revoked. Every 30-minute sync attempt since at least May 13 afternoon has logged:

[port-sheet] token error: HTTP Error 400: Bad Request

This means port sheet synchronization has been silently failing for hours. No data loss has occurred yet, but future syncs will continue to fail until the token is refreshed.

Root cause: Google OAuth tokens have an expiration window (typically 1 hour for access tokens). The refresh token stored in the daemon's credential cache has either expired (if not refreshed within 6 months) or been manually revoked by the user. The Python script uses google-auth-oauthlib to handle token refresh, but if the refresh token itself is invalid, the refresh fails.

Remediation Steps

Re-authenticating the Google OAuth Token

The auth_ga.py script (which was modified during the session) is designed to walk through the OAuth flow for a Google account and store the resulting credentials. To fix port sheet sync:

cd /home/ubuntu/jada
python3 auth_ga.py --account dangerouscentaur@gmail.com --scope sheets

This will prompt for interactive browser-based authentication, retrieve a fresh access token and refresh token, and store them securely in the credential cache (protected by file permissions: 600 on the secrets directory).

After re-authentication, the next 30-minute sync cycle will succeed.

Addressing Turn-Limit Exits

Two approaches:

Increase the turn limit in the daemon's configuration (if the limit is externally configurable) from 30 to 50 or higher. This allows longer reasoning chains but increases API costs.
Break complex tasks into smaller subtasks at the progress dashboard level. For example, instead of "debug and fix the booking widget," create separate tasks: "identify the issue in booking widget," "implement the fix,"