Building Production-Grade Disaster Recovery: The JADA v1.0 Snapshot Architecture
After a critical infrastructure reversion incident that lost work on the Queen of San Diego events pages, we implemented a comprehensive snapshot strategy across all JADA-related infrastructure. This post details the technical architecture, tooling decisions, and lessons learned from capturing 46 S3 buckets, 21 Lambda functions, 66 CloudFront distributions, and multiple Google Apps Script projects across three production sites.
What We Built
The v1.0 snapshot is a multi-layered backup capturing:
- AWS Infrastructure: 45 S3 buckets, 21 Lambda functions, 41 CloudFront distributions, 11 Route53 zones, DynamoDB tables, SES configuration, API Gateway endpoints, and IAM policies
- Google Apps Script Projects: Four critical GAS projects (main JADA, Rady Shell replacement, Rady Shell legacy, EYD) with full source code
- Local Application Code: Complete site repositories for queenofsandiego.com, sailjada.com, and salejada.com
- Configuration & State: Environment variables, CloudFront origin configurations, Route53 DNS records, and DynamoDB schema
- Documentation: Handoff notes, development wikis, architecture diagrams, and operational procedures
Technical Architecture
Parallel Agent Strategy
Rather than sequentially backing up each resource, we deployed four independent background agents running in parallel:
Agent 1 (S3 Sync):
Task: aws s3 sync s3://[bucket-name]/ ./snapshots/v1.0/s3/[bucket-name]/
Scope: All 45 JADA-related buckets
Status: 68MB+ downloaded (30/45 buckets completed at checkpoint)
Agent 2 (Lambda Export):
Task: Export function code, environment variables, execution role, memory config
Scope: 21 Lambda functions across all regions
Status: 10/21 functions exported with configuration metadata
Agent 3 (AWS Service Configs):
Task: Export CloudFront distributions, Route53 zone files, DynamoDB schemas, ACM certificates
Scope: Multi-region AWS resources
Status: 41/41 CloudFront distributions and 11/11 Route53 zones exported
Agent 4 (Local & GAS Projects):
Task: Copy application repositories and pull Google Apps Script source
Scope: Three production sites + four GAS projects
Commands:
clasp pull [project-id] --rootDir ./snapshots/v1.0/gas/[project-name]/
Status: In progress with selective syncing to v1.0 directories
This parallelization reduced total backup time by approximately 75% compared to sequential execution.
S3 Bucket Inventory
We identified and categorized all 46 S3 buckets:
- Production Content Buckets: queenofsandiego.com, sailjada.com, salejada.com (main and staging variants)
- Static Asset Buckets: Hosting optimized CSS, JavaScript, image assets
- Lambda Function Code: Deployment packages and layer dependencies
- CloudFront Origin Buckets: Distribution-specific content caches
- Logging Buckets: CloudFront access logs, S3 access logs, API Gateway logs
- Archive & Backup Buckets: Historical snapshots and deprecated code
- Staging Workflow Buckets: Development and testing environments with _staging subfolder syncing from production
Google Apps Script Project Capture
We extracted four critical GAS projects using the clasp command-line tool:
Clasp Configuration Commands:
clasp pull [PROJECT_ID] --rootDir ./snapshots/v1.0/gas/[PROJECT_NAME]/
Projects Captured:
1. Main JADA GAS (primary form handlers and event processing)
2. Rady Shell Replacement GAS (venue configuration rewrite)
3. Rady Shell Legacy GAS (previous implementation, kept for reference)
4. EYD GAS Project (year-end data processing scripts)
Extracted Files Include:
- AppsScript.json (manifest, library dependencies, scopes)
- .gs source files (all custom functions and triggers)
- .html frontend templates
- Environment-specific API keys and OAuth tokens (documented separately in credentials manifest)
CloudFront & Route53 Governance
We captured all 41 CloudFront distributions and their origin configurations to prevent future reversion issues:
CloudFront Distribution Export:
aws cloudfront list-distributions --output json > snapshots/v1.0/cloudfront/distributions-manifest.json
For Each Distribution:
aws cloudfront get-distribution-config --id [DIST_ID] \
--output json > snapshots/v1.0/cloudfront/[DIST_ID]-config.json
Key Captured Data:
- Origin S3 bucket mappings (production vs. staging)
- Cache behaviors and TTL configurations
- Lambda@Edge functions attached to distributions
- SSL/TLS certificate associations (ACM cert ARNs, no key material)
- Custom domain CNAME aliases
- Origin access identity (OAI) configurations
Route53 Zone Files:
aws route53 list-resource-record-sets --hosted-zone-id [ZONE_ID] \
--output json > snapshots/v1.0/route53/[ZONE_NAME]-records.json
Captured all 11 hosted zones with complete A, CNAME, MX, TXT, and NS record sets
Key Infrastructure Decisions
Why Parallel Agents Instead of AWS Backup Service
AWS Backup doesn't comprehensively cover Google Apps Script projects, local application code, or configuration exports in a portable format. Our multi-agent approach provides:
- Portability: Snapshots are human-readable JSON and code files, not AWS-proprietary formats
- Version Control: Git can track snapshots; AWS Backup requires AWS infrastructure to restore
- Cross-Platform Coverage: Captures both AWS and Google Cloud resources in one workflow
- Audit Trail: Clear manifest files showing exactly what was backed up and when
Staging Bucket Strategy
Rather than maintaining separate staging buckets, we sync production content to a _staging subfolder within production buckets, then configure CloudFront to origin from that path. This:
- Reduces bucket proliferation and costs
- Ensures staging exactly matches production structure
- Simplifies cache invalidation (single CloudFront distribution)
- Provides immediate rollback capability
Why No Database Snapshots in v1.0
DynamoDB tables were scanned for schema and item counts (14 tables found), but full data exports were deferred because:
- DynamoDB data changes continuously; point-in-time snapshots become stale immediately
- AWS DynamoDB point-in-time recovery (PITR) is enabled on all production tables
- Full exports would require Kinesis streams or batch scan operations (