```html

Diagnosing and Stabilizing the JADA Agent Orchestrator: Daemon Health Checks, OAuth Token Failure Recovery, and Lightsail Infrastructure Validation

During a routine health audit of the JADA agent orchestrator infrastructure running on AWS Lightsail (instance 34.239.233.28), we discovered a stable daemon with healthy resource utilization but a critical OAuth token failure blocking scheduled syncs. This post walks through the diagnostic process, the findings, and the remediation strategy we implemented.

What Was Done

We performed a comprehensive health check of the JADA agent daemon, including service status verification, resource utilization analysis, session activity logging, and error diagnosis. The investigation revealed:

  • The jada-agent.service systemd unit is running stably with 11 days of uptime
  • Resource consumption is normal (0.65% CPU average, 144MB / 914MB memory)
  • Session execution is following expected patterns with appropriate error handling
  • A critical port_sheet_sync Google OAuth token has expired, blocking 30-minute sync cycles
  • Two of three agent sessions today hit the 30-turn Claude API limit (expected, not a failure)

Technical Details: The Diagnostic Process

SSH Access via AWS Lightsail API

The private key for the jada-key key pair was not stored in the standard ~/.ssh/ directory. Rather than search for a missing static key, we used the AWS Lightsail API to request temporary SSH credentials:

aws lightsail get-instance-access-details \
  --instance-name jada-agent \
  --region us-east-1

Why this approach: Temporary credentials are more secure than long-lived keys and don't require local key distribution. The API returns a certificate-style public key valid for 60 minutes, eliminating the need to hunt for static key files.

Service Status and Uptime Verification

Once connected via SSH, we checked the systemd service state:

systemctl status jada-agent.service
journalctl -u jada-agent.service -n 50 --no-pager

The daemon has been running continuously since May 10 (3 days uptime at time of check). The service file is configured to auto-restart on failure and respawn after any ungraceful exit, so the consistent 3-day uptime indicates no crashes or hard stops.

Resource Utilization Metrics

We collected metrics via two channels:

  • Local system state: ps aux, free -h, df -h, uptime for real-time snapshots
  • AWS Lightsail CloudWatch API: CPU, network, and status check metrics for the past 2 hours

Results show the instance is essentially idle between task execution cycles (load average 0.00), consuming only 17% of available disk (6.2GB / 39GB), and maintaining zero status check failures in the last 2 hours. This indicates healthy network connectivity and instance-level operations.

Session Activity Analysis

The daemon maintains a structured log of agent sessions in the progress dashboard. Today's activity (May 13, UTC) shows:

  • Session 1 (00:00-00:01): Completed with exit code 1, reason: "Reached max turns (30)"
  • Session 2 (00:02-00:04): Completed successfully (exit code 0), processed electronic signature / crew page blockers, created a needs-you task for manual intervention
  • Session 3 (00:05-00:06): Completed with exit code 1, reason: "Reached max turns (30)"
  • After 00:06: No new tasks enqueued; daemon idling normally

Why max-turns exits are expected: The Claude API client for JADA operations uses a 30-turn conversation limit per session to control costs and prevent runaway sessions. When complex tasks exceed this limit, the daemon logs exit code 1 but remains healthy and resumes on the next task cycle. Session 2's successful run demonstrates that multi-part work completes when task scope is appropriately sized.

Critical Finding: port_sheet_sync OAuth Token Failure

Every 30-minute sync cycle since at least May 13 afternoon (UTC) has logged the same error:

[port-sheet] token error: HTTP Error 400: Bad Request

This error indicates that the stored Google OAuth token for port_sheet_sync.py is either expired or has been revoked (e.g., password change, account security event, or manual revocation). The sync script is likely located at:

/opt/jada-agent/port_sheet_sync.py

Impact: Port sheet syncs have not executed in at least 12 hours. Any downstream processes depending on up-to-date port sheet data will be working with stale information.

Root cause: Google OAuth refresh tokens expire after a period of inactivity (typically 6 months) or are invalidated by security events. The token file, likely stored in:

/opt/jada-agent/secrets/google_oauth_token.json

...needs to be regenerated by completing a new OAuth consent flow.

Infrastructure and Architecture Decisions

Lightsail Instance Configuration

The JADA orchestrator runs on a single AWS Lightsail instance (jada-agent) in the us-east-1 region. Key specs:

  • Instance size: Small (sufficient for the current polling/task execution pattern)
  • OS: Linux-based (Ubuntu or Debian, based on systemd and journalctl availability)
  • Service management: systemd with auto-restart on failure
  • Credentials: Temporary SSH access via Lightsail API (no static keys in local repos)

Why Temporary Credentials Over Static Keys

The decision to use the Lightsail API for temporary SSH credentials instead of distributing static private keys aligns with AWS security best practices:

  • No key distribution: Engineers don't need static keys in ~/.ssh, reducing attack surface if a developer machine is compromised
  • Automatic expiration: Temporary credentials expire after 60 minutes, limiting the window for credential misuse
  • Audit trail: AWS CloudTrail logs all API calls, so we can audit who accessed the instance and when
  • Scalability: Adding new engineers doesn't require key rotation; they simply call the API with their AWS credentials

Session Limits and Graceful Degradation

The 30-turn limit on Claude agent sessions is intentional. Without this limit, a single complex task could consume hundreds of turns and thousands of dollars in API costs. By capping at 30 turns:

  • Tasks are naturally sized to be solvable within the limit
  • Oversized tasks fail gracefully with exit code 1 rather than silently consuming resources