Preventing S3 Deployment Regressions: A Case Study in Stale Local State and the Hard Rules That Fix It
Last week, a deployment session accidentally overwrote three working features on queenofsandiego.com by pushing a stale local index.html over a newer production version in S3. The hero crossfade animation (JADA → BOOK NOW), the Stripe embedded checkout flow, and the deliberate removal of a "For Ranch & Coast readers..." headline all vanished. This post documents what went wrong, why it happened, and the hard rules we've now encoded to prevent it.
The Incident: What Happened
The session deployed to both S3 staging and S3 production in a single command, violating the staging-first principle. It used a local copy of index.html that was several commits behind what was already live in the production S3 bucket. The deployment tool (aws s3 cp with --recursive) overwrote the newer remote file with the older local one, erasing three independent features.
Root cause: The agent ignored its own prior session summary, which warned that local files might be stale relative to S3. No diff was performed before the push. No snapshot of production was taken beforehand. The staging-only rule exists in the codebase but was not checked before executing the deploy.
Technical Details: The Deployment Flow
Our deployment pipeline for queenofsandiego.com follows this sequence:
- Local dev: Edit files in
/Users/cb/Documents/repos/sites/queenofsandiego.com/ - Staging push:
aws s3 cp --recursive ./path-to-built-files s3://staging.queenofsandiego.com/ --profile qos - CloudFront invalidation (staging): Invalidate
staging-dist-idafter push - Human review: Visit
staging.queenofsandiego.com, verify no regressions - Production push: Only after staging sign-off,
aws s3 cp --recursive ./path-to-built-files s3://queenofsandiego.com/ --profile qos - CloudFront invalidation (prod): Invalidate production distribution ID
The incident violated this at step 1.5: no pre-deployment diff and snapshot. It also violated step 4: both staging and prod were pushed in the same atomic operation, with no human review opportunity in between.
Infrastructure: S3, CloudFront, and State
Our infrastructure uses:
- S3 buckets:
queenofsandiego.com(production),staging.queenofsandiego.com(staging). Both are private; traffic flows through CloudFront only. - CloudFront distributions: One for each bucket. Cache behavior is TTL 300 seconds for HTML, longer for assets.
- Route53: DNS CNAME records point
queenofsandiego.comandstaging.queenofsandiego.comto their respective CloudFront distribution domain names. - No S3 versioning enabled. This is the critical gap: we overwrite objects in place, so old versions are unrecoverable without manual backups.
The stale-local problem occurs because a developer may pull the repo, edit a file locally, then weeks pass before deploying. Meanwhile, another session (or manual push) has updated S3 directly. A naive s3 cp --recursive then clobbers the newer remote file with the older local one.
The Hard Rules: Preventing Regressions
We've now encoded eight hard rules into /Users/cb/Documents/repos/sites/queenofsandiego.com/CLAUDE.md and a condensed summary into the top-level CLAUDE.md for cross-site awareness:
- D1 — Pull and diff before edit: Before modifying any file destined for S3, run
aws s3 cp s3://queenofsandiego.com/index.html ./index.html.prod --profile qosanddiff -u index.html.prod index.htmllocally. Document the result in the session. - D2 — Staging only, single target: Never deploy to both staging and production in one command. Always stage first, always in isolation:
aws s3 cp ./index.html s3://staging.queenofsandiego.com/index.html --profile qos - D3 — One logical change per deploy: If editing the hero fade, deploy only the hero fade. Don't batch unrelated fixes. This isolates regression scope.
- D4 — Obey prior session warnings: If a prior session summary says "local files may be stale," treat it as blocking. Re-pull and re-verify before proceeding.
- D5 — Snapshot production before overwriting:
aws s3 cp s3://queenofsandiego.com/index.html ./backups/index.html.$(date +%s) --profile qosbefore anycpin the reverse direction. - D6 — Six-line proof block: Before executing any deployment, print a block showing: old hash, new hash, S3 bucket name, file path, timestamp, and staging/prod target. Require explicit human confirmation in chat.
- D7 — Feature-token registry: Maintain a registry (grep-able in code comments) of major features and their unique CSS class or ID. Before prod push, grep S3-current against these tokens to confirm nothing vanished. Example:
/* FEATURE_TOKEN: jada-hero-crossfade-1 */ - D8 — Escalate to CB if S3 is ahead: If S3 has a newer version than local, pause and ask CB whether to pull-and-rebase or proceed with a merge strategy. Never overwrite without decision.
Key Decisions and Rationale
Why not enable S3 versioning? Cost and complexity. With versioning, every overwrite creates a new object version; over weeks, the bill grows. Our snapshot approach (D5) gives us recovery without ongoing cost, provided we catch the regression quickly.
Why require staging review before prod? Staging is cheap to verify and is the only environment where a regression is invisible to users. Catching regressions here prevents customer-facing outages.
Why feature tokens? Visual inspection of staging works, but it's fallible—especially for animation or interaction features. A grep of production HTML against a known token set is deterministic and can be scripted.
Why escalate to CB instead of auto-merging? Merging stale local and new remote requires understanding intent. CB owns the decision of which version is canonical; the agent should not guess.
What's Next
These rules are now loaded automatically in every queenofsandiego.com session via the CLAUDE.md file. We're implementing a pre-deployment checklist as a markdown table in the session context so agents can cross-check