Diagnosing and Resolving OAuth Token Failures in Distributed Agent Infrastructure
This post documents the investigation and findings from a health check on the jada-agent orchestrator daemon running on AWS Lightsail (34.239.233.28). During routine monitoring, we discovered a critical OAuth authentication failure affecting Google Sheets sync operations, alongside some architectural patterns worth examining for scalability.
What Was Done
We performed a comprehensive daemon health audit including:
- SSH access via AWS Lightsail temporary credentials API (replacing unavailable local key pair)
- Service status verification of
jada-agent.servicesystemd unit - Log analysis from
port_sheet_sync.pycovering the last 24 hours - CPU/memory/disk utilization metrics via Lightsail CloudWatch API
- Task queue inspection against the progress dashboard
- Session counter validation across today's three agent invocations
Technical Details: What We Found
Service Health (Good)
The jada-agent.service unit has been running continuously since May 10, 2026—3 days of uptime. Systemd logs show clean restarts with no segfaults or OOM kills. The daemon implements a 60-second polling loop on the task queue, which means:
- Load average of 0.00 between tasks (expected for I/O-bound work)
- CPU baseline ~0.65% during the 60s sleep cycle
- Memory footprint 144MB / 914MB available—no pressure
- Disk usage 6.2GB / 39GB (17%)—well under any thresholds
AWS Lightsail status checks reported zero failures in the last 2 hours, indicating stable network and hypervisor health.
Session Management and Task Processing (Mixed Results)
We analyzed the session counter stored in the daemon's local state file. Today (UTC) shows:
Session 1 (00:00-00:04): Exit code 1 (max turns reached)
Session 2 (00:02-00:05): Exit code 0 (success)
Session 3 (00:05-00:12): Exit code 1 (max turns reached)
The daemon respects a hardcoded 30-turn limit per invocation—a safety measure to prevent runaway costs on Claude API calls. Sessions 1 and 3 hit this ceiling. Session 2, which completed within the limit, successfully processed e-signature and crew page generation tasks, creating a task marked "needs-you" for human review.
This pattern suggests task complexity is sometimes exceeding the turn budget. While the daemon doesn't crash on exit code 1, incomplete tasks remain queued until the next session. For workloads with consistent turn overruns, we should consider either:
- Increasing the turn limit (higher API cost, longer execution time)
- Breaking complex tasks into smaller sub-tasks (architectural refactor)
- Implementing a retry mechanism that resumes from the last turn checkpoint
Critical Issue: Google OAuth Token Expiration in port_sheet_sync.py
Every 30-minute sync invocation since at least this afternoon has failed with:
[port-sheet] token error: HTTP Error 400: Bad Request
The port_sheet_sync.py script, located at /opt/jada-agent/scripts/port_sheet_sync.py, maintains a Google OAuth token file (path obfuscated for security, but stored in the service's credential directory). This token has either expired or been revoked at the Google OAuth provider.
Google OAuth 2.0 refresh tokens typically expire after 6 months of inactivity or if the user revokes access via their Google Account settings. The daemon logs show no successful sync since approximately 2026-05-13 14:30 UTC, suggesting the token became invalid sometime this afternoon.
Impact: Port sheet syncs have been non-functional for ~4 hours. Any downstream systems relying on up-to-date port data are stale.
Infrastructure and Architecture
Access Pattern: Lightsail Temporary Credentials
The jada SSH private key was not stored in the expected locations on the developer workstation. Rather than reconstructing key distribution, we leveraged the AWS Lightsail GetInstanceAccessDetails API endpoint, which generates temporary OpenSSH certificates valid for 60 minutes. This approach:
- Eliminates persistent key management burden on local machines
- Provides audit trail via CloudTrail (who accessed which instance, when)
- Reduces exposure window—credentials are single-use and ephemeral
- Scales better: no SSH key rotation ceremonies needed
Example command pattern (pseudocode):
aws lightsail get-instance-access-details \
--instance-name jada-orchestrator \
--region us-east-1
# Parses temporary private key from response
# Writes to temp file with 0600 permissions
# Executes: ssh -i /tmp/temp-key.pem ec2-user@34.239.233.28
# Cleans up temp file after session
Monitoring: CloudWatch Metrics via Lightsail API
Lightsail exposes CPU, network, and status check metrics through its native API. We pulled 2 hours of historical data without installing additional agents—Lightsail's built-in hypervisor monitoring is sufficient for this use case. The metrics endpoint provides 1-minute granularity, useful for detecting transient spikes.
Task Queue Architecture
The daemon consumes tasks from a persistent queue (likely DynamoDB or a similar service, though the exact backend is abstracted). The progress dashboard provides human visibility. The 5-session-per-day limit is enforced by a counter that resets at UTC midnight, preventing unbounded costs.
Key Decisions and Rationale
- Why temporary creds instead of key retrieval: Simpler than recovering a lost key pair; provides better audit trail; acceptable latency for interactive debugging.
- Why the 30-turn limit matters: Claude API pricing scales linearly with tokens. Unconstrained agentic loops can exhaust budgets rapidly. The limit is a cost-control measure, but it creates a trade-off: some complex tasks get truncated. This is acceptable for now but becomes a bottleneck as task complexity grows.
- Why OAuth token failure is critical: Port sheet sync is a dependency for downstream reporting. The 4-hour outage likely affected any reports or dashboards referencing port data. Re-authentication must happen quickly.
What's Next
- Re-authenticate Google OAuth: Run the
auth_ga.pytool with thedangerouscentaur@gmail.comaccount to refresh the Google OAuth token forport_sheet_sync.py. Verify that the new token is persisted to the credential store and that the next 30-minute sync succeeds. - Evaluate turn limit: Collect metrics on session completion rates. If >30% of sessions are hitting the turn limit, consider bumping it to 40-50 and monitoring cost impact.
- Implement checkpoint/resume: For tasks that exceed the turn limit, implement a mechanism to save Claude conversation state and resume in the next session. This allows complex