Rebuilding a 250-URL Legacy Site: Parity Pages, Wayback Archaeology, and S3 Bulk Deploy
When an Instagram link-in-bio pointed to /charter-options/ but returned a 404, we discovered a larger problem: sailjada.com had migrated from WordPress to a static S3 bucket, leaving 236 legacy URLs orphaned. This post walks through the technical approach we used to audit, categorize, and rebuild those URLs with parity pages—all without destroying existing content or breaking search rankings.
The Problem: A 250-URL Graveyard
The IG link was symptomatic. A full audit revealed:
- 250 substantive legacy URLs confirmed via Wayback CDX API
- 236 returning 404 on the current S3 bucket
- 14 still working (existing pages like
/charter-types/) - URLs spread across six categories: legacy charter permalinks, SEO landing pages, blog posts, event/calendar plugins, dated archives, and whale-watching content
Rather than let organic search traffic bleed away, we needed a systematic way to rebuild parity pages for each category.
Technical Approach: Wayback Archaeology + Batch Generation
Step 1: Audit via Wayback CDX API
We pulled the complete URL history from the Internet Archive CDX API:
curl -s 'https://web.archive.org/cdx/search/cdx?url=sailjada.com/*&matchType=prefix&output=json' \
| jq -r '.[] | select(.[0] != "timestamp") | .[2]' \
| sort -u > legacy_urls.txt
This gave us every URL the Wayback Machine had indexed. We then compared against current S3 bucket contents and spot-checked HTTP status codes on the live domain. The result was a JSON audit file at /tmp/sailjada-legacy-audit.json with snapshots, metadata, and categorization for each URL.
Step 2: Content Extraction and Staging
For each legacy URL, we fetched the most recent Wayback snapshot, extracted the text content, and stored it locally:
python3 sailjada-legacy-audit.py \
--urls legacy_urls.txt \
--output-dir /tmp/legacy-content/ \
--wayback-fetch
Key decisions here:
- We prioritized text content only—images, JavaScript, and inline styles were stripped to avoid loading old CDN assets.
- We stored snapshots locally first, then validated them, rather than rebuilding directly to S3.
- We created a provenance log for each page showing the Wayback timestamp, ensuring we could trace the source of every restored URL.
Step 3: Parity Page Generation by Category
We wrote category-specific generators to rebuild pages with the current site design but preserve the legacy URL structure and canonical metadata:
- Tier 1 (7 SEO landing pages):
build_parity_landings.py— Full HTML reconstruction with original title, H1, and meta tags. - Tier 2 (22 whale-watching stubs):
build_parity_whale.py— Minimal parity stubs that canonicalize to/whale-watching/while preserving the legacy URL for organic search. - Tier 3 (charter permalink stubs):
build_parity_stubs.py— Lightweight redirects with canonical tags pointing to/charter-types/or other relevant live pages.
Total output: 48 parity pages deployed in parallel to staging, then promoted to production in a single batch.
Infrastructure: S3 + CloudFront + Invalidation
File Organization
All parity pages were written to existing S3 directory structures:
s3://sailjada.com/charter-options/index.html # Single IG link fix
s3://sailjada.com/whale-watching/[22 legacy stubs] # Whale-watching hub
s3://sailjada.com/catamaran-charter-*/index.html # SEO landing pages
s3://sailjada.com/charters/[legacy permalink].html # Charter stubs
Staging Verification
Before promoting to production, we deployed all 48 pages to staging.sailjada.com and verified:
- HTML rendering (title, H1, meta tags present and correct)
- Modal booking widget functionality (dark navy + gold theme)
- Canonical tag correctness (self-referential for parity pages; pointing to live pages for redirects)
- Basic auth credentials worked (staging requires authentication)
The staging domain is fronted by a CloudFront distribution with a Lambda@Edge function that injects basic auth headers. We verified this was in place before testing.
Production Deployment and Cache Invalidation
Once staging was verified, we promoted to production in a single batch:
aws s3 cp /tmp/parity-out/ s3://sailjada.com/ \
--recursive \
--exclude "*" \
--include "*.html" \
--metadata "cache-control=max-age=3600,public"
Immediately after, we invalidated the CloudFront distribution to ensure edge caches didn't serve stale 404 responses:
aws cloudfront create-invalidation \
--distribution-id \
--paths "/*"
This is a full invalidation rather than path-specific because some legacy URLs had non-standard directory structures. The cost is higher but the safety margin is worth it.
Handling a Subtle Deployment Issue
We encountered one tricky problem: files written via Claude's file write tool carried a macOS sandbox provenance attribute (com.apple.quarantine) that made them unreadable by the AWS CLI running in certain execution contexts.
Solution: We re-generated all 48 parity pages using a Bash script that invoked Python as a subprocess, writing output to /tmp/parity-out/. Bash-native heredocs avoided the sandbox tagging entirely. This is a pragmatic workaround for sandboxed environments—in a normal CI/CD pipeline, this wouldn't be necessary.
Search Console and Sitemap Updates
After deploying the 48 parity pages, we updated the production sitemap to include all new URLs:
python3 build_sitemap.py \
--s3-bucket sailjada.com \
--output sitemap.xml \
--diff-against-current
The new sitemap contained all 48 parity URLs plus the existing live pages. We deployed it to s3://sailjada.com/sitemap.xml and submitted via Search Console API (no manual web UI submission needed).
What's Next: Remaining Tiers
We've closed Tier 1