Automating OAuth Token Refresh and Credential Management for Multi-Service CI/CD Pipelines

```html

During a recent infrastructure maintenance session, we encountered a critical blocker: Google OAuth tokens expiring in automated workflows that depend on Gmail API access, Google Sheets integration, and DynamoDB queries across multiple AWS regions. Rather than manually refreshing credentials, we implemented a systematic approach to token lifecycle management, credential isolation, and safe secrets storage on our EC2 deployment host.

The Problem: Token Expiration in Production Workflows

Our JADA operations stack orchestrates booking workflows, crew scheduling, and charter coordination through scripts that query Gmail, Google Sheets, and DynamoDB. These scripts run on a dedicated EC2 instance and authenticate using OAuth 2.0 refresh tokens stored locally. When tokens expire, the entire automation chain fails silently—reports don't generate, crew notifications don't send, and booking data doesn't sync.

The root cause: our reauth_google.py script wasn't properly handling token refresh cycles, and the credential storage strategy mixed secrets with application code, making rotation and auditing difficult.

Technical Architecture: Credential Isolation and Token Refresh

We restructured credential management around three core principles:

Secrets directory isolation: All OAuth tokens, API keys, and service credentials moved from scattered config files into ~/.secrets/ with strict 0700 permissions
Token refresh automation: A dedicated reauth_google.py script handles OAuth 2.0 refresh token exchange without manual intervention
Unified token lifecycle: Multiple service scripts (gmail_jennifer.py, build_sheet.py, crew dispatch) share a single cached token at ~/.secrets/google_token.json

The OAuth refresh flow works like this:


1. Script attempts Gmail API call with cached token
2. Token expired → 401 response
3. Exception handler calls reauth_google.py
4. reauth_google.py reads refresh_token from ~/.secrets/
5. Exchanges refresh_token for new access_token via Google's token endpoint
6. Writes new access_token to ~/.secrets/google_token.json (atomic write)
7. Retries original API call with fresh token

Implementation Details: The Patched Reauth Script

The original reauth_google.py on the EC2 instance had a hardcoded path that assumed the script lived in a specific repo directory. We patched it to use environment-aware paths:


# Old (fragile):
secrets_file = "/home/ubuntu/repos/jada-ops/.secrets/refresh_token.json"

# New (portable):
secrets_file = os.path.expanduser("~/.secrets/refresh_token.json")

This change allows the script to work regardless of where it's invoked from—whether called directly, via cron, or imported as a module by other services. We also added explicit error handling for missing secrets files and invalid JSON, so token refresh failures surface with clear diagnostics rather than cryptic token errors downstream.

The patched script validates:

Secrets directory exists and has correct permissions (0700)
Refresh token file contains valid JSON
OAuth client credentials are available (loaded from environment or config)
Token endpoint response is successful before caching the new token

Secrets Storage Strategy

All credentials live in ~/.secrets/ on the EC2 instance with the following structure:


~/.secrets/
├── google_token.json          # Cached access_token (auto-refreshed)
├── refresh_token.json          # Long-lived refresh token (never shared)
├── gmail_credentials.json      # OAuth client ID/secret
└── aws_service_role           # AWS IAM role (credential-less, via EC2 instance profile)

Critically, we do not store AWS credentials on disk. The EC2 instance uses an IAM instance profile with policies granting access to:

DynamoDB tables: crew-dispatch, charter-chats in us-east-1 and us-west-2
S3 bucket: shipcaptaincrew-data (for charter manifest backups)
CloudFront distribution: d2x...cloudfront.net (cache invalidation after schema updates)

This approach eliminates long-term AWS key material from the instance while maintaining least-privilege access through role assumptions.

Integration with Existing Scripts

Three critical scripts depend on this refresh mechanism:

/tmp/gmail_jennifer.py — Searches Gmail for booking confirmations (scope: gmail.readonly)
/tmp/build_sheet.py — Generates monthly revenue reports as XLSX (scope: sheets.readonly)
/tmp/gmail_diag.py — Diagnostic utility to validate token state and Gmail connectivity

Each script follows this pattern when making API calls:


try:
    result = service.users().messages().list(...).execute()
except HttpError as e:
    if e.resp.status == 401:
        subprocess.run(["/path/to/reauth_google.py"], check=True)
        # Retry the call
    else:
        raise

The unified token cache means reauth happens once per expiration cycle across all services, not separately for each script.

Deployment and Validation

The patched reauth_google.py was deployed via SSH to the EC2 instance with syntax validation:


# Backup original
cp /path/to/reauth_google.py /path/to/reauth_google.py.backup

# Deploy patched version
# (syntax checked locally via python -m py_compile before deployment)

# Verify permissions
chmod 755 /path/to/reauth_google.py

We validated the fix by:

Running python -m py_compile on the patched script to catch syntax errors
Executing gmail_diag.py to confirm token refresh succeeds
Checking ~/.secrets/google_token.json timestamps before/after refresh
Verifying DynamoDB queries (which don't depend on OAuth) still work independently

Key Decisions and Trade-offs

Why use environment variables over hardcoded paths: Hardcoded paths break when scripts move, get symlinked, or run in different deployment contexts. Using os.path.expanduser() makes the script portable across local dev, CI/CD pipelines, and production instances.

Why share a single token cache: Multiple separate token caches create sync issues and unnecessary API calls. A unified cache at a well-known path means any service can refresh tokens globally, and all subsequent calls benefit immediately.

Why not use AWS Secrets Manager: For this deployment phase, local secrets with file-based permissions (0700) provide sufficient isolation. AWS Secrets Manager would add latency and API call overhead for every token read. Once the deployment scales to multiple EC2 instances or requires audit logging, we'll migrate to Secrets Manager.