```html

Diagnosing and Resolving OAuth Token Expiration in the JADA Agent Orchestrator

During a routine health check of the JADA agent daemon running on AWS Lightsail (34.239.233.28), we discovered that the port_sheet_sync service—responsible for syncing booking data to Google Sheets—had been silently failing for the past 18+ hours due to an expired or revoked Google OAuth token. This post documents the diagnosis methodology, the infrastructure used to identify the issue, and the remediation strategy.

What We Found: Silent Failure in the Sync Pipeline

The JADA agent daemon itself was healthy: the systemd service jada-agent.service was running continuously with 11 days of uptime and normal CPU/memory metrics. However, logs from the port sheet sync cron job revealed a recurring error pattern:

[port-sheet] token error: HTTP Error 400: Bad Request

This error appeared in every 30-minute sync attempt since at least May 13 afternoon UTC. The 400 Bad Request response from Google's OAuth API indicates the token was either expired, revoked, or malformed—not a temporary network issue. Critically, this failure mode was silent from the perspective of the booking workflow: tasks weren't being queued as failures; syncs were simply not running.

Technical Diagnosis: How We Accessed the Remote Daemon

The jada-key SSH private key was not stored in the local ~/.ssh directory, requiring us to use an alternative authentication path. The diagnosis process involved three steps:

  • AWS Lightsail API key retrieval: We called the Lightsail GetInstanceAccessDetails API endpoint to generate temporary SSH credentials for the instance, avoiding the need for long-lived key storage on the development machine.
  • Remote daemon inspection: Via SSH, we pulled service status from systemctl status jada-agent.service, examined the daemon's stdout/stderr logs, and checked process-level metrics using ps and free.
  • Cron log analysis: We reviewed syslog entries for the port_sheet_sync cron job to identify the exact error and frequency of failures.

This multi-layer approach confirmed that the daemon itself was operational and processing tasks correctly (3 agent sessions completed today, with 1 task successfully created), but the Google Sheets sync pipeline had stalled.

Infrastructure Context: Multi-Site, Multi-Service Setup

The development environment manages several properties and services:

  • Primary sites:
    • /Users/cb/Documents/repos/sites/queenofsandiego.com/ — Booking automation via Google Apps Script (BookingAutomation.gs)
    • /Users/cb/Documents/repos/sites/sailjada.com/ — Main booking platform frontend
    • /Users/cb/Documents/repos/sites/86from.com/ — Recently renamed from 86dfrom.com; SEO-focused content site with GA4 integration
  • Deployment targets: S3 buckets (production and staging) with CloudFront CDN distribution invalidation on each deploy
  • Authentication layer: /Users/cb/Documents/repos/tools/auth_ga.py — Python script that manages Google Analytics OAuth token refresh for the dangerouscentaur@gmail.com account

The port sheet sync job runs as a cron task on the Lightsail instance and requires a valid Google OAuth token to authenticate API calls to Google Sheets. That token is stored server-side and referenced by port_sheet_sync.py.

Root Cause: Stale OAuth Token

Google OAuth tokens expire after a fixed period (typically 1 hour for access tokens; refresh tokens can become invalid if the user revokes access, changes password, or if Google's revocation policies are triggered). The port_sheet_sync.py` script does not implement automatic token refresh—it simply uses the stored token. When that token expired, the script began receiving 400 Bad Request responses from Google's API without falling back to a refresh flow.

Why this wasn't immediately visible: the cron job ran silently in the background, logging only to syslog. The daemon itself continued operating normally because the sync service failure did not propagate as a task creation event or exception that would alert the operator.

Key Decisions and Architecture Patterns

1. Temporary credential retrieval via Lightsail API instead of key distribution: We used the GetInstanceAccessDetails

2. Cron-based sync vs. event-driven sync: The current architecture uses a 30-minute polling cron job for port sheet sync. This is simple but creates a maximum 30-minute delay between booking state change and Sheets sync, and failures are silent. A more resilient pattern would be event-driven: each booking update publishes a message to SQS or SNS, and a Lambda function (with built-in retry and DLQ support) handles the sync. However, this requires refactoring the booking workflow.

3. Token refresh strategy: The auth_ga.py` script exists to manage token refresh for GA4 reporting, but port_sheet_sync.py` does not use the same pattern. Moving forward, both should share a common token refresh library that automatically rotates tokens before expiration, or implement OAuth 2.0 client credentials flow with a service account instead of user-account OAuth, which avoids expiration entirely.

Immediate Remediation

To restore port sheet sync functionality:

  • Re-authenticate the Google OAuth token for port_sheet_sync.py` using the auth_ga.py script or a similar flow
  • Verify that the refreshed token is stored in the correct location on the Lightsail instance (likely /home/jada/.config/port_sheet_sync/token.json or similar)
  • Manually trigger one sync cycle to confirm the 400 error is resolved
  • Monitor syslog over the next 24 hours to ensure the 30-minute sync cadence resumes without errors

What's Next: Hardening the Sync Pipeline

Beyond the immediate token refresh, we should implement:

  • Token expiration alerts: Add a health check cron job that tests the stored OAuth token against Google's API every 6 hours and alerts if a 400 is received
  • Automatic token rotation: Integrate port_sheet_sync.py with a token refresh library (using the same pattern as auth_ga.py) so tokens refresh automatically before expiration
  • Explicit error propagation: Modify the cron job to write sync failures to a CloudWatch log group or SNS topic, ensuring visibility into sync outages
  • Service account migration: Long-term, migrate from user-account OAuth to a Google Cloud service account with a private key, eliminating token expiration as a failure mode

The daemon health check also revealed that two out of three agent sessions today hit the 30-turn Claude limit,