Diagnosing and Remediating OAuth Token Failures in the JADA Agent Orchestrator

During a scheduled health check of the JADA agent daemon running on AWS Lightsail instance 34.239.233.28, we discovered a critical authentication failure in the port sheet synchronization pipeline. This post documents the diagnosis methodology, root cause analysis, and remediation strategy for OAuth token lifecycle management in long-running background services.

What Was Done

We performed a comprehensive health audit of the jada-agent.service systemd unit deployed on the Lightsail instance. The daemon had been running continuously for 11 days with 3 days of uptime on the current process incarnation. Our diagnostics revealed:

Service Status: Active and running, with normal CPU utilization (0.65% average) and healthy memory footprint (144MB / 914MB)
Session Activity: 3 sessions consumed out of the daily 5-session quota, with 2 runs hitting Claude's 30-turn context limit and 1 completing successfully
Critical Issue: The port_sheet_sync.py OAuth token has been failing every 30-minute sync cycle with HTTP 400 Bad Request errors
Status Checks: Zero infrastructure failures; network and disk I/O nominal

Technical Details: SSH Access and Diagnostics

Since the private key material for jada-key was not available in the local development environment, we employed AWS Lightsail's temporary credential system rather than storing long-lived SSH keys on developer machines. This approach minimizes key sprawl and aligns with zero-trust principles.

Command sequence for accessing the instance:

# Fetch temporary SSH credentials via Lightsail API
aws lightsail get-instance-access-details \
  --instance-name jada-agent-orchestrator \
  --region us-east-1

# Connect using certificate-based authentication
ssh -i /tmp/jada_temp_cert.pem ec2-user@34.239.233.28

# Verify systemd unit status
systemctl status jada-agent.service

# Extract 24-hour activity from daemon logs
journalctl -u jada-agent.service --since "24 hours ago" -n 500

The daemon's log output revealed the port_sheet_sync failure pattern clearly:

[port-sheet] token error: HTTP Error 400: Bad Request
[port-sheet] sync cycle failed at 2026-05-13T14:30:00Z
[port-sheet] retrying in 30 minutes
[port-sheet] token error: HTTP Error 400: Bad Request
[port-sheet] sync cycle failed at 2026-05-13T15:00:00Z

This pattern repeated hourly throughout the monitoring window, indicating a persistent authentication failure rather than transient network issues.

Root Cause: OAuth Token Expiration and Revocation Handling

The port_sheet_sync.py script uses OAuth 2.0 credentials stored in the secrets directory (path withheld for security) to authenticate with Google's Sheets API. When an OAuth refresh token expires or is manually revoked—typically after 6 months of inactivity or following a user password change—the daemon continues attempting synchronization with stale credentials.

The current implementation in port_sheet_sync.py catches the HTTP 400 error but lacks a mechanism to:

Distinguish between transient network errors and permanent authentication failures
Alert operators when re-authentication is required
Gracefully degrade service rather than consuming resources in a retry loop
Implement exponential backoff with maximum retry limits

The Google OAuth 2.0 flow requires human interaction to obtain a fresh authorization code, which cannot be automated without storing user credentials—a security anti-pattern. Thus, daemon-initiated token refresh is not viable; operator intervention is necessary.

Infrastructure and Service Architecture

The JADA agent orchestrator is deployed as a single systemd service on AWS Lightsail with the following characteristics:

Instance Type: Lightsail bundled instance with 1GB RAM and 1 vCPU
Operating System: Amazon Linux 2
Service File Location: /etc/systemd/system/jada-agent.service
Script Directory: /opt/jada/scripts/ (contains port_sheet_sync.py, daemon loop, and task processors)
Configuration: Environment variables loaded from /opt/jada/.env (credentials sourced separately)
Log Aggregation: journalctl for systemd logs; application logs written to /var/log/jada-agent.log

The daemon operates in a polling loop, checking a "progress dashboard" (internal task queue) every 60 seconds for new work. When tasks are available, it invokes the Claude API through the agent framework to process e-signature workflows, crew page generation, and other automation tasks.

Key Decisions and Remediation Strategy

Why we used Lightsail temporary credentials: Developer laptops should never store production SSH keys. By using the Lightsail API's get-instance-access-details endpoint, credentials are generated on-demand, time-limited (typically valid for 15 minutes), and auditable through CloudTrail. This eliminates the risk of key compromise from a stolen laptop.

Why OAuth token failure requires manual intervention: Google's OAuth 2.0 implementation requires explicit user authorization through a browser-based consent flow. Storing user passwords to automate re-authentication violates the OAuth specification and introduces credential exposure. Instead, we must implement a notification system to alert operators when tokens expire.

Recommended remediation for port_sheet_sync.py:

Implement specific HTTP error handling: distinguish 400 (auth failure) from 429 (rate limit) and 5xx (server error)
Upon detecting a persistent 400 error (>3 consecutive failures), write a structured alert to /var/log/jada-agent-alerts.log with severity CRITICAL
Configure a CloudWatch Logs Insights query to detect this pattern and trigger an SNS notification to the ops channel
Add exponential backoff: after the first 400 error, extend the retry interval from 30 minutes to 1 hour, then 4 hours, with a maximum backoff of 24 hours
Create a documented runbook for re-authenticating the port sheet sync token, involving the auth_ga.py utility (once fixed) or an equivalent OAuth re-auth script

Current Agent Session Activity

During the monitoring window, the daemon successfully executed 3 out of 5 available daily sessions:

Session 1 (00:00 UTC): Hit max turns (30), exited with code 1. This is expected behavior when task complexity exceeds the context window; the daemon logs it but does not crash.
Session 2 (00:02 UTC): Completed successfully. Processed e-signature page blockers and created a task for manual follow-up on crew page generation logic.
Session 3 (00:05 UTC): Hit max turns (30), exited