Diagnosing and Remediating the JADA Agent Daemon: Infrastructure Health Check and OAuth Token Recovery
During a routine health check of our orchestrator daemon running on AWS Lightsail (34.239.233.28), we discovered a critical OAuth token failure in the port sheet sync process, alongside some expected—but worth documenting—Claude API turn limits being hit during complex agent sessions. This post walks through the diagnostic approach, infrastructure findings, and remediation strategy.
What We Did
We performed a comprehensive health audit of the jada-agent.service daemon, including:
- Service status and uptime verification
- System resource utilization (CPU, memory, disk)
- AWS Lightsail status check metrics (last 2 hours)
- Recent daemon logs and session activity
- OAuth token validation for dependent services
- Agent session turn-limit analysis
Technical Details: Lightsail Instance Health
The Lightsail instance running the daemon showed excellent baseline health:
Service Status: Active (running)
Uptime: 11 days
Load Average: 0.00 (idle)
CPU: 0.65% average
Memory: 144 MB / 914 MB (15.8% utilization)
Disk: 6.2 GB / 39 GB (17% used)
Status Checks (last 2 hours): 0 failures
The low CPU and memory footprint indicate the daemon's 60-second polling loop is lightweight. The instance has been stable for over 10 days without resource contention or status check failures, suggesting the infrastructure layer is solid.
Session Activity Analysis
Over the past 24 hours (UTC), the daemon executed three agent sessions within its daily quota of 5:
- Session 1 (00:00 UTC): Hit Claude API turn limit (30 turns). Exit code 1. Task complexity exceeded allocated turns.
- Session 2 (00:02 UTC): Completed successfully. Processed e-signature page blockers and crew page generator code. Created a
needs-youtask for manual review. - Session 3 (00:05 UTC): Hit Claude API turn limit (30 turns). Exit code 1.
- Post-Session 3: No new tasks queued; daemon idling normally.
The exit code 1 from hitting turn limits is logged as an error but does not crash the daemon. This is expected behavior when task complexity exceeds the 30-turn budget. Session 2's successful completion demonstrates the daemon recovers and continues processing normally between sessions.
Critical Issue: Port Sheet Sync Token Failure
The most significant finding: every 30-minute sync attempt since at least May 13 afternoon has failed with a Google OAuth token error.
[port-sheet] token error: HTTP Error 400: Bad Request
This indicates the Google OAuth token stored for port_sheet_sync.py (located in /Users/cb/Documents/repos/tools/port_sheet_sync.py) is either expired or revoked. The token is likely a refresh token that wasn't renewed, or the underlying Google API credentials (client ID and secret) were rotated without re-authorization.
Impact: Port sheet synchronization is currently non-functional. Any upstream systems depending on real-time port sheet data will be receiving stale information.
Infrastructure and Authentication Architecture
The daemon's architecture relies on several credential and secret stores:
- SSH Access: AWS Lightsail API provides temporary SSH certificates paired with the private key from the
jada-keykey pair. The key itself is not stored locally; we retrieve it on-demand via the Lightsail API. - Google OAuth Tokens: Stored in a secrets directory (location:
~/.config/jada-agent/secrets/or equivalent on the Lightsail instance). Structured as JSON withclient_id,client_secret, andaccess_token/refresh_token. - Service Logging: Systemd journal accessible via
journalctl -u jada-agent.service.
The auth_ga.py utility (located at /Users/cb/Documents/repos/tools/auth_ga.py) is responsible for refreshing Google Analytics OAuth tokens via the google-auth-oauthlib library. It accepts an account email parameter (e.g., dangerouscentaur@gmail.com) and updates the stored token.
Remediation Steps
To restore port sheet synchronization:
- Re-authenticate the Google OAuth token:
This will trigger a browser-based OAuth flow (or device code flow if headless) and update the stored token.python3 ~/Documents/repos/tools/auth_ga.py --account dangerouscentaur@gmail.com - Verify the token is written to the correct secrets path:
Confirm the new token is stored in the same location
port_sheet_sync.pyreads from. - Manually trigger a sync test:
Run
port_sheet_sync.pydirectly to ensure it can authenticate and fetch data before the next scheduled 30-minute interval. - Monitor the daemon logs:
Watch for the next sync attempt to confirm the error is resolved.journalctl -u jada-agent.service -f
Addressing Turn Limit Exits
The two sessions that hit the 30-turn Claude API limit (exit code 1) are a pattern worth monitoring. This is not a bug—the daemon correctly exits when the budget is exhausted. However, if complex tasks regularly require >30 turns:
- Task decomposition: Break larger tasks into smaller, independently-completable subtasks to fit within turn budgets.
- Increase turn limit: If turn budgets are too conservative, update the daemon configuration to allow up to 50 or 75 turns for complex tasks (trade-off: higher API costs).
- Implement checkpointing: Resume interrupted sessions from their final state rather than restarting from scratch.
Key Decisions and Rationale
Why we didn't store SSH keys locally: Storing private keys on disk increases attack surface. The Lightsail API provides temporary, time-limited SSH certificates, eliminating the need for persistent key storage.
Why Google OAuth tokens are centralized: Multiple scripts (auth_ga.py, port_sheet_sync.py, and others) share the same Google workspace credentials. Centralizing tokens in a single secrets directory simplifies rotation and reduces duplication.
Why exit code 1 on turn limits is acceptable: The daemon logs these exits but continues running. The intent is to fail fast and allow operators to decide whether to rerun the task with higher budgets or decompose it. This is safer than silently truncating