Diagnosing and Resolving a Multi-Site CI/CD Pipeline: Daemon Health, OAuth Token Rotation, and SEO Content Deployment
This session involved troubleshooting the jada-agent orchestrator daemon running on a Lightsail instance, diagnosing OAuth token expiration across multiple Google services, and coordinating deployments across three separate static sites. The work demonstrates typical challenges in maintaining a distributed agent-driven CI/CD system with third-party authentication dependencies.
Infrastructure Overview
The deployment spans three primary domains, each with distinct purposes:
- queenofsandiego.com: Main booking automation site with Google Apps Script integration (BookingAutomation.gs)
- sailjada.com: Secondary booking and information site
- 86from.com: SEO-focused informational property (recently migrated from 86dfrom.com)
All static assets are served through CloudFront distributions backed by S3 buckets. The jada-agent daemon runs on a Lightsail instance at 34.239.233.28 and orchestrates multi-step tasks through Claude sessions, managing up to 5 concurrent sessions with a 30-turn limit per session.
What Was Done: Three Parallel Workstreams
1. Daemon Health Assessment
Initial request: verify the jada-agent.service status and confirm active task processing. The challenge was that the SSH private key (jada-key) was not stored locally on the development machine.
Solution approach:
- Checked
~/.ssh/and searchedrepos.envfor key path references — key not found locally - Retrieved temporary SSH credentials via the AWS Lightsail API endpoint using IAM permissions
- Validated the certificate-based SSH connection (using the returned private key paired as an OpenSSH certificate)
- Executed remote commands to collect daemon state without permanently storing credentials on disk
Key findings:
jada-agent.service: Active and running for 3+ days with no crashes- System load: 0.00 (idle between tasks), CPU 0.65% average
- Today's session history: 3 of 5 available sessions used; 2 hit the 30-turn Claude limit (exit code 1), 1 completed successfully
- No CPU spikes or memory pressure
- CloudWatch status checks: 0 failures in last 2 hours
The daemon is operationally healthy. The two max-turn exits are not crashes but expected behavior when task complexity exceeds the 30-turn window.
2. Google OAuth Token Expiration and Rotation
SSH logs revealed a critical failure: the port_sheet_sync.py script's Google OAuth token was returning HTTP 400 Bad Request every 30 minutes. This prevented port sheet syncs from running.
Root cause: OAuth tokens have limited lifetimes and require refresh token rotation. The token stored for the port_sheet_sync service had expired and was not automatically refreshed.
Investigation steps:
- Located the auth script at
/Users/cb/Documents/repos/tools/auth_ga.py - Verified that the Google-auth-oauthlib library was installed and accessible
- Confirmed that existing credentials for the dangerouscentaur@gmail.com account contained valid client_id and client_secret (reusable across multiple service authentications)
- Checked the jada-agent secrets directory for stored token state
Resolution: Re-authentication was queued as a priority task. The auth_ga.py script uses the OAuth 2.0 authorization code flow to obtain a refresh token valid for 6 months, with automatic refresh on expiration. The script stores the refresh token in the secrets directory with restricted file permissions (0600).
Why this matters: Google's OAuth implementation requires periodic re-authentication for long-running service accounts. Hard-coding tokens guarantees eventual failure; the auth script ensures rotation happens transparently without manual intervention once configured.
3. SEO Content and Booking Widget Deployment
While the daemon health investigation was underway, work proceeded on the 86from.com SEO project. This site was migrated from 86dfrom.com and required content updates and booking widget fixes.
File structure:
- Original directory:
/Users/cb/Documents/repos/sites/86dfrom.com/ - Renamed to:
/Users/cb/Documents/repos/sites/86from.com/ - Index file:
/Users/cb/Documents/repos/sites/86from.com/site/index.html - New SEO page:
/Users/cb/Documents/repos/sites/86from.com/site/what-does-86d-mean
Booking widget issue: The embedded JavaScript booking widget was using double-brace template syntax {{ }} that conflicted with the HTML context. Systematic search revealed these braces appeared only within the widget's <script> section (lines targeted via grep and regex), not globally in the HTML.
Fix applied:
// Before (causing parse errors):
var bookingConfig = {{ propertyId: "86d-main", ... }};
// After (valid JavaScript):
var bookingConfig = { propertyId: "86d-main", ... };
This is a common issue when mixing server-side template languages (which often use {{ }}) with client-side JavaScript. The solution was to replace all double-brace occurrences within the widget block only, then syntax-check the extracted JavaScript block before deployment.
Deployment:
- Fixed index.html pushed to staging S3 bucket:
s3://queenofsandiego-staging/ - CloudFront cache invalidated for the staging distribution
- New SEO content page created and deployed to production S3 bucket
- Production CloudFront distribution cache invalidated
- Version tag with model ID embedded in booking widget comments for tracking
Key Architectural Decisions
Why use Lightsail + Agent daemon instead of Lambda? The jada-agent daemon requires persistent state (session history, pending task queue, rate limiting). Lambda's ephemeral nature would lose state between invocations. A continuously running daemon on Lightsail (t2.small instance, minimal cost) allows stateful orchestration of multi-turn AI reasoning without rebuilding context.
Why pull metrics via API instead of CloudWatch alone? Real-time daemon health requires checking service status, log state, and process counts. CloudWatch metrics have 60-second granularity; SSH let us query the exact state synchronously, critical for diagnosing transient failures.
Why separate staging and production S3 buckets? Staging allows testing booking widget changes without impacting live bookings. The split also provides a rollback point if a deployment introduces errors.
What's Next
- OAuth re-authentication: Execute auth_ga.py to refresh the port_sheet_sync token and verify 30-minute sync