Recovering 250 Lost Pages: Automated Legacy URL Audit and Parity Rebuild for sailjada.com
The Problem: Dead Links from a Migrated WordPress Site
SailJada's website migration from WordPress to a static S3+CloudFront architecture left hundreds of legacy URLs orphaned. An Instagram link-in-bio campaign pointed to /charter-options/—a URL that once existed but now returned 404. More critically, a full audit revealed 236 out of 250 legacy URLs were completely inaccessible, representing lost organic traffic, broken social links, and poor user experience for anyone arriving from old bookmarks or search results.
Rather than accept that traffic loss, we built an automated system to identify, categorize, and rebuild parity pages for every legitimate legacy URL, prioritizing by business value.
Technical Approach: Wayback Machine + S3 + Python Automation
Step 1: Comprehensive Legacy URL Audit
We used the Internet Archive CDX API to pull the complete history of indexed URLs from the sailjada.com domain:
# Query Wayback CDX for all captured URLs
curl "https://cdx.crossref.org/search?url=sailjada.com/*&output=json&collapse=urlkey" \
| jq '.[] | select(.statuscode=="200")' | sort | uniq > legacy-urls.json
This returned ~500 captured snapshots. We then wrote /tmp/sailjada-legacy-audit.py to:
- Test each unique legacy URL path against the live production S3 bucket (
s3://sailjada.com) - Check HTTP status codes on the live domain
- Fetch and cache Wayback snapshots for content analysis
- Classify URLs by content type (blog posts, landing pages, product pages, calendar events, etc.)
- Store metadata (original title, Wayback snapshot date, content length) in JSON
Result: 250 substantive URLs identified as legitimate business content. 236 returned 404; only 14 remained accessible. Full audit data stored in /tmp/sailjada-legacy-audit.json.
Step 2: Content Categorization and Priority Ranking
We bucketed the 236 dead URLs by type and business value:
| Category | Count | Strategy |
|---|---|---|
SEO landing pages (e.g., /catamaran-charter-captain-san-diego/) |
~10 | Rebuild full parity pages with Wayback content + new design |
| Legacy charter product permalinks | ~20 | Thin parity stubs with canonical redirect to current product |
WordPress blog posts (/blog/2010-2024/...) |
42 | Rebuild from Wayback snapshots or consolidate to hub |
| Event/calendar plugin URLs | 32 | Redirect to current event landing pages |
| Whale watching permalinks | 17 | Dedicated /whale-watching/ hub + parity stubs |
Low-value archives (/2010/06/, taxonomy pages) |
~30 | Stub or skip; minimal business impact |
| Other (broken assets, misc) | ~25 | Case-by-case evaluation |
Infrastructure: S3, CloudFront, and Deployment Pipeline
S3 Bucket Structure
All static pages are versioned in s3://sailjada.com/ with the following directory layout:
s3://sailjada.com/
├── charter-options/index.html # NEW: Parity for /charter-types/ (IG link fix)
├── charter-types/index.html # Canonical product page (2026-05-25)
├── whale-watching/index.html # NEW: Hub for 17 legacy URLs
├── whale-watching/legacy-stubs/
│ ├── whale-watching-tours/index.html
│ ├── marine-life-san-diego/index.html
│ └── ... (20 more stubs)
├── parity-landings/
│ ├── catamaran-charter-captain-san-diego/index.html
│ ├── dinner-cruises/index.html
│ └── ... (7 more landing pages)
└── sitemap.xml # Updated with all new paths
CloudFront Distribution and Invalidation
SailJada uses a single CloudFront distribution (ID: D5XXXXXX — normalized for this post) fronting the S3 bucket with:
- Origin:
sailjada.com.s3.us-west-2.amazonaws.com - Cache TTL: 3600 seconds for
text/html, 31536000 for static assets - Lambda@Edge function: Basic auth for staging environment
- Origin Shield: Enabled in
us-west-2to reduce origin load
After each deployment batch, we invalidated CloudFront paths using the AWS CLI:
aws cloudfront create-invalidation \
--distribution-id