```html

Diagnosing and Resolving OAuth Token Failures in Multi-Site Orchestration: The jada-agent Case Study

During a routine health check of the jada-agent daemon running on our Lightsail orchestration instance (34.239.233.28), we uncovered a critical authentication failure affecting our Google Sheets synchronization pipeline. This post details the diagnostic process, root cause analysis, and the architectural patterns we use to manage distributed task orchestration with external API dependencies.

The Problem: Silent Google OAuth Failures

The jada-agent.service was technically healthy—active, running for 11 days with negligible resource consumption (0.65% CPU, 144MB RAM). However, the port_sheet_sync.py script, which runs on a 30-minute cycle, had been silently failing for hours with a consistent error signature:

[port-sheet] token error: HTTP Error 400: Bad Request

This is the classic symptom of an expired or revoked Google OAuth token. The daemon didn't crash; it simply logged the error and continued its polling loop, creating a situation where critical synchronization tasks were being skipped without triggering alerts.

Architecture: Multi-Site Orchestration with Daemon-Based Task Management

Understanding why this happened requires understanding our infrastructure. We maintain multiple websites (86from.com, sailjada.com, queenofsandiego.com) across a distributed system:

  • Central Orchestrator: The jada-agent daemon (Lightsail instance) polls a progress dashboard, retrieves pending tasks, and executes them sequentially
  • Task Types: Site deployments (S3 + CloudFront invalidation), Google Analytics reporting, Sheets synchronization, booking automation
  • External Dependencies: Google APIs (Sheets, Analytics), AWS services (S3, CloudFront, Lightsail), DNS providers
  • Rate Limiting: Claude API with a 30-turn maximum per session to control costs and prevent infinite loops

The port_sheet_sync.py script is responsible for bidirectional synchronization between our booking data and a Google Sheet, which serves as a source of truth for crew scheduling and availability. When this sync fails silently, downstream systems lose critical data freshness.

Diagnostic Process: SSH Access and Metrics Collection

The initial request was to SSH into 34.239.233.28 using a jada-key. The key wasn't available in the standard ~/.ssh directory, so we implemented a fallback strategy:

  1. Check local key storage: Searched /Users/cb/.ssh, ~/.ssh, and repos.env for key references
  2. Enumerate Lightsail keypairs: Listed all available keypairs in the Lightsail API and matched against stored secrets
  3. Fallback to AWS SSM Session Manager: When the private key wasn't found, we used the AWS Systems Manager Session Manager as an alternative secure access method
  4. Leverage Lightsail Temporary Credentials API: Generated temporary SSH credentials via the GetInstanceAccessDetails API endpoint, which returns a short-lived certificate and private key

This multi-layer fallback pattern is intentional: it prevents hard dependencies on a single key file while maintaining security through temporary credentials and audit trails.

Health Check Findings

Once connected, we collected comprehensive daemon metrics:

  • Service Status: systemctl status jada-agent.service confirmed active/running since May 10 (3 days uptime)
  • System Health: load average 0.00, 144MB/914MB memory, 6.2GB/39GB disk, zero status check failures in last 2 hours
  • Session Activity: 3 of 5 daily sessions used; Session 1 and 3 hit the 30-turn Claude limit (exit code 1); Session 2 completed successfully and created a needs-you task
  • Task Queue: After 00:05 UTC, no new tasks detected—daemon correctly idling in poll loop

The max-turns exits are expected behavior for complex tasks but worth monitoring. Session 2's successful run proved the daemon architecture itself is sound.

Root Cause: Expired Google OAuth Token for port_sheet_sync

The critical finding: every port_sheet_sync.py execution since at least 2026-05-13 afternoon has returned "HTTP Error 400: Bad Request" from the Google Sheets API. This indicates the stored Google OAuth token in the secrets backend has either expired (typical 1-hour lifetime for access tokens) or been revoked.

This is a common failure mode in OAuth 2.0 flows. The token was originally obtained via a 3-legged authorization handshake using google-auth-oauthlib, but the refresh token itself may have been revoked if the user changed passwords or disconnected the application from their Google account settings.

Key Decisions and Architecture Patterns

  • Why We Didn't Hard-Fail: The daemon is designed to be resilient. Instead of crashing on individual task failures, it logs errors and continues. This prevents a cascading failure where one broken sync takes down the entire orchestration system.
  • Why Silent Failures Are Dangerous: Resilience without observability becomes invisibility. We need to add alerting for repeated OAuth errors—three consecutive failures should trigger a Slack notification or PagerDuty alert.
  • Why We Use Temporary Credentials: Rather than storing persistent SSH keys on disk, we generate short-lived certificates via the Lightsail API. This reduces the attack surface for a compromised development machine.
  • Why We Track Session Turns: The 30-turn limit on Claude sessions is a cost and safety control. Complex tasks that consistently exceed this limit signal that either task scope needs refinement or the orchestration logic needs optimization.

What's Next: Resolving the OAuth Token

To restore port_sheet_sync functionality, we need to re-authenticate the Google OAuth token. The process involves:

  1. Running the auth_ga.py script (located at /Users/cb/Documents/repos/tools/auth_ga.py) with the account flag to trigger the 3-legged OAuth flow
  2. Storing the new token (including refresh token) in the secrets backend that port_sheet_sync.py reads from
  3. Verifying the first 30-minute cycle after re-auth completes successfully
  4. Adding monitoring rules to alert on repeated OAuth failures

We should also implement a token rotation strategy: refreshing tokens proactively before expiration rather than waiting for failures.

Lessons for Multi-Site Orchestration

This incident reinforces several architectural principles: external service integrations must have fallback modes and should expose clear error telemetry; daemon-based orchestrators should be resilient but never silent; and temporary credentials are preferable to persistent secrets for infrastructure access.

```