```html

Recovering 250 Lost Pages: Automated Legacy URL Audit and Parity Rebuild for sailjada.com

The Problem: Dead Links from a Migrated WordPress Site

SailJada's website migration from WordPress to a static S3+CloudFront architecture left hundreds of legacy URLs orphaned. An Instagram link-in-bio campaign pointed to /charter-options/—a URL that once existed but now returned 404. More critically, a full audit revealed 236 out of 250 legacy URLs were completely inaccessible, representing lost organic traffic, broken social links, and poor user experience for anyone arriving from old bookmarks or search results.

Rather than accept that traffic loss, we built an automated system to identify, categorize, and rebuild parity pages for every legitimate legacy URL, prioritizing by business value.

Technical Approach: Wayback Machine + S3 + Python Automation

Step 1: Comprehensive Legacy URL Audit

We used the Internet Archive CDX API to pull the complete history of indexed URLs from the sailjada.com domain:

# Query Wayback CDX for all captured URLs
curl "https://cdx.crossref.org/search?url=sailjada.com/*&output=json&collapse=urlkey" \
  | jq '.[] | select(.statuscode=="200")' | sort | uniq > legacy-urls.json

This returned ~500 captured snapshots. We then wrote /tmp/sailjada-legacy-audit.py to:

  • Test each unique legacy URL path against the live production S3 bucket (s3://sailjada.com)
  • Check HTTP status codes on the live domain
  • Fetch and cache Wayback snapshots for content analysis
  • Classify URLs by content type (blog posts, landing pages, product pages, calendar events, etc.)
  • Store metadata (original title, Wayback snapshot date, content length) in JSON

Result: 250 substantive URLs identified as legitimate business content. 236 returned 404; only 14 remained accessible. Full audit data stored in /tmp/sailjada-legacy-audit.json.

Step 2: Content Categorization and Priority Ranking

We bucketed the 236 dead URLs by type and business value:

Category Count Strategy
SEO landing pages (e.g., /catamaran-charter-captain-san-diego/) ~10 Rebuild full parity pages with Wayback content + new design
Legacy charter product permalinks ~20 Thin parity stubs with canonical redirect to current product
WordPress blog posts (/blog/2010-2024/...) 42 Rebuild from Wayback snapshots or consolidate to hub
Event/calendar plugin URLs 32 Redirect to current event landing pages
Whale watching permalinks 17 Dedicated /whale-watching/ hub + parity stubs
Low-value archives (/2010/06/, taxonomy pages) ~30 Stub or skip; minimal business impact
Other (broken assets, misc) ~25 Case-by-case evaluation

Infrastructure: S3, CloudFront, and Deployment Pipeline

S3 Bucket Structure

All static pages are versioned in s3://sailjada.com/ with the following directory layout:

s3://sailjada.com/
├── charter-options/index.html          # NEW: Parity for /charter-types/ (IG link fix)
├── charter-types/index.html            # Canonical product page (2026-05-25)
├── whale-watching/index.html           # NEW: Hub for 17 legacy URLs
├── whale-watching/legacy-stubs/
│   ├── whale-watching-tours/index.html
│   ├── marine-life-san-diego/index.html
│   └── ... (20 more stubs)
├── parity-landings/
│   ├── catamaran-charter-captain-san-diego/index.html
│   ├── dinner-cruises/index.html
│   └── ... (7 more landing pages)
└── sitemap.xml                         # Updated with all new paths

CloudFront Distribution and Invalidation

SailJada uses a single CloudFront distribution (ID: D5XXXXXX — normalized for this post) fronting the S3 bucket with:

  • Origin: sailjada.com.s3.us-west-2.amazonaws.com
  • Cache TTL: 3600 seconds for text/html, 31536000 for static assets
  • Lambda@Edge function: Basic auth for staging environment
  • Origin Shield: Enabled in us-west-2 to reduce origin load

After each deployment batch, we invalidated CloudFront paths using the AWS CLI:

aws cloudfront create-invalidation \
  --distribution-id