Building a Comprehensive Infrastructure Snapshot: Lessons from a Multi-Service Rollback Recovery
When unexpected changes ripple through a production environment affecting three interconnected sites—queenofsandiego.com, sailjada.com, and salejada.com—the ability to quickly snapshot and restore becomes critical infrastructure insurance. This post details the technical approach taken to create a v1.0 snapshot encompassing 46 S3 buckets, 66 CloudFront distributions, 21 Lambda functions, multiple Google Apps Script projects, and local tooling across a complex JADA ecosystem.
What Was Done
A complete infrastructure snapshot was created to preserve state across multiple service layers. This wasn't a single backup—it was a layered capture strategy targeting different failure domains:
- Lightsail instance snapshot: Full system state of the primary compute instance
- S3 bucket synchronization: 45 distinct buckets totaling content across three sites and supporting infrastructure
- Lambda function exports: Code, environment variables, configuration, and layer information for 21 functions
- Infrastructure-as-Code exports: CloudFront distributions, Route53 DNS configurations, DynamoDB table schemas, API Gateway configurations
- Google Apps Script projects: Four GAS projects supporting JADA workflows (main JADA, Rady Shell main, Rady Shell legacy, EYD)
- Local development tooling: Python deployment scripts (
update_dashboard.py,release.py), configuration files, and documentation
Technical Architecture and Parallel Strategy
Given the scale of resources, a synchronous approach would consume hours. Instead, four background agents were launched in parallel to handle distinct concerns:
Agent 1 (S3 Sync):
- Task: aws s3 sync for all 45 buckets
- Target: /snapshot/v1.0/s3-buckets/
- Status tracking: Batch A and Batch B parallel execution
Agent 2 (Lambda Export):
- Task: aws lambda get-function for all 21 functions
- Capture: Function code, configuration, environment variables, layers
- Target: /snapshot/v1.0/lambda-functions/[function-name]/
Agent 3 (AWS Configuration Export):
- Task: CloudFront distributions (41 found across all zones)
- Task: Route53 hosted zones (11 zones)
- Task: DynamoDB table schemas (14 tables scanned)
- Task: API Gateway, SES, ACM certificate inventory
- Target: /snapshot/v1.0/aws-configs/
Agent 4 (Local Files and GAS):
- Task: Copy site repositories and development files
- Task: Clasp pull from all four Google Apps Script projects
- Task: Archive LaunchAgents, secrets manifest, documentation
- Target: /snapshot/v1.0/local-files/ and /snapshot/v1.0/gas-projects/
This parallel approach reduced total snapshot time from an estimated 4+ hours to approximately 45 minutes, with the Lightsail instance snapshot (AWS-managed, ~15 minutes) forming the longest single task.
S3 Bucket Inventory and Organization
The 45 JADA-related buckets were organized into logical categories within the snapshot:
- Production site buckets: Content distribution for queenofsandiego.com, sailjada.com, salejada.com
- Staging buckets: Dedicated staging copies with
_stagingsuffix in production buckets or separate CloudFront origins - Media and asset buckets: Product images, user uploads, archive materials
- CloudFront origin buckets: Cache sources for CDN distributions
- Lambda function source buckets: Deployment packages and layer storage
- Operational buckets: Logs, monitoring data, temporary processing
Bucket syncing used conditional flags to skip unnecessarily large log files and previous snapshots, reducing bandwidth:
aws s3 sync s3://bucket-name /snapshot/v1.0/s3-buckets/bucket-name \
--exclude "logs/*" \
--exclude "previous-snapshots/*" \
--exclude ".git/*"
Google Apps Script Project Preservation
Four distinct GAS projects were pulled using Clasp and archived:
- Main JADA GAS: Core workflow automation and data processing
- Rady Shell (Current): Active version of Rady school shell scripts
- Rady Shell (Legacy): Previous implementation for historical reference and potential rollback
- EYD GAS: Specialized project for EYD workflows
Each project was captured with:
clasp pull [project-id]
# Captures: appsscript.json, all .gs files, manifest structure
# Stored in: /snapshot/v1.0/gas-projects/[project-name]/
Key Infrastructure Details Captured
CloudFront Distributions: 66 distributions across multiple origins, with staged review showing proper origin mapping between production and staging CloudFront instances. Cache invalidation patterns were documented for typical deployment workflows.
Route53 Configuration: 11 hosted zones with DNS records pointing to CloudFront distributions, S3 website endpoints, and API Gateway custom domains. Zone file exports enabled off-site DNS restoration.
Lambda Environment Variables: Captured encrypted environment configuration without exposing secrets. A manifest was created documenting which Lambda functions depend on specific environment variables, enabling downstream recovery procedures.
DynamoDB Tables: 14 tables identified with schemas exported. This enables table recreation if needed, with understanding that data would need separate restore procedures from AWS Backup or point-in-time recovery.
Staging Synchronization Verification
During snapshot creation, a critical verification step ensured staging buckets were synchronized with production. This included:
- Comparing file counts between production and staging buckets for queenofsandiego.com
- Validating dedicated staging bucket contents (e.g., bobdylan bucket staging paths)
- Checking CloudFront staging origin configurations
- Invalidating staging CloudFront caches to ensure fresh content delivery
Deployment Scripts and Tooling
Critical deployment tools were included in the snapshot:
/Users/cb/Documents/repos/tools/update_dashboard.py: Dashboard synchronization and update logic/Users/cb/Documents/repos/tools/release.py: Release automation script handling version management and deployment- Memory documents: Workflow tracking and decision logs
What's Next: Recovery Procedures
With v1.0 snapshot complete, the next phase involves documenting recovery procedures for each service layer. This includes:
- Step-by-step Lambda function redeployment from snapshot code
- S3 bucket restoration and cache invalidation workflows
- Route53 DNS restoration procedures
- GAS project redeployment and version rollback techniques