```html

Automating GetMyBoat Lead Capture with Playwright: Infrastructure and Pipeline Design

Over the past development session, we built out an end-to-end automation pipeline to capture and analyze warm leads from GetMyBoat for the Sail JADA charter business. This post covers the architecture decisions, technical implementation, and the infrastructure patterns we chose to make lead ingestion reliable and maintainable.

What We Built

The goal was straightforward: automate login to a GetMyBoat account, capture incoming inquiries from the inbox, and feed them into a downstream analysis pipeline. The challenge was that GetMyBoat is a Single Page Application (SPA) with dynamic navigation and session-dependent state, making traditional scraping unreliable.

We created a six-module Python automation suite:

  • /tmp/gmb_session.py — Session and browser lifecycle management
  • /tmp/gmb_login.py — Headed authentication flow with persistent profile storage
  • /tmp/gmb_explore.py — SPA navigation and URL discovery
  • /tmp/gmb_inbox.py — Inbox page load and lead extraction
  • /tmp/gmb_lead_scan.py — Read-only lead enumeration and filtering
  • /tmp/gmb_watch.py — Persistent session watcher for real-time inbox URL tracking

Technical Architecture: Why Playwright Over Selenium

We chose Playwright over Selenium for three architectural reasons:

  • Native DevTools Protocol support: Playwright connects directly to Chromium's CDP, avoiding the WebDriver indirection layer. This gives us better reliability on SPAs where timing is critical.
  • Cross-browser context isolation: Playwright's context model lets us spawn multiple browser tabs within a single process without memory bloat, important for scaling lead capture across multiple accounts later.
  • Built-in synchronization primitives: Methods like page.wait_for_url() and page.wait_for_load_state('networkidle') handle SPA timing issues declaratively, reducing flaky waits.

The trade-off was environment setup complexity. Playwright bundles Chromium builds per OS, requiring explicit synchronous installation in isolated venvs. We addressed this by creating a dedicated Python 3.10+ virtual environment with both Playwright and the Google API client libraries (needed downstream for Gmail integration and lead enrichment).

Implementation Details: The Four-Stage Pipeline

Stage 1: Persistent Browser Profile (gmb_login.py)

Rather than logging in fresh each run, we establish a "headed" (visible UI) Playwright session that persists authentication state to disk. The profile is stored at ~/.playwrightprofiles/getmyboat_carole. On first run, the script:

  • Launches a headed browser instance
  • Navigates to https://getmyboat.com/login
  • Waits for human input (or automated credential injection) to complete 2FA if present
  • Saves the context state (cookies, localStorage, sessionStorage) to the profile directory

Subsequent runs reuse the profile, eliminating repeated authentication overhead. This is critical because GetMyBoat's 2FA can timeout, and persistent profiles degrade gracefully when sessions expire (we catch TimeoutError and re-authenticate).

Stage 2: SPA Navigation & URL Discovery (gmb_explore.py)

GetMyBoat's SPA doesn't expose the inbox URL in the DOM initially. We had to reverse-engineer navigation by:

  • Attaching a response listener to intercept all XHR/fetch calls
  • Filtering for API endpoints that indicate inbox state changes
  • Using page.wait_for_url() with a regex pattern to detect when the app navigates to the owner inbox

The key insight: SPA URL changes often happen *after* the page has painted. We added explicit waits for both DOM stabilization (wait_for_load_state('networkidle')) and visible UI elements (page.locator('.inbox-container').is_visible()) to avoid race conditions.

Stage 3: Lead Extraction & Filtering (gmb_inbox.py, gmb_lead_scan.py)

Once we have a stable inbox page, we extract leads by:

  • Querying the DOM for lead list items using semantic selectors (e.g., [data-testid="inquiry-row"])
  • For each lead, extracting: sender name, inquiry date, message preview, and a unique lead ID
  • Filtering for "warm" leads by cross-referencing against known email domains and keyword patterns (e.g., replies with specific vessel types or location matches)
  • Storing results in a structured JSON format for downstream processing

The gmb_lead_scan.py module runs in read-only mode, meaning it never clicks "reply," marks as read, or modifies state—useful for safe exploratory runs and testing.

Stage 4: Real-Time Monitoring (gmb_watch.py)

For ongoing lead monitoring, we built a persistent watcher that:

  • Keeps the browser session alive and the inbox page open
  • Polls for new lead notifications (visual indicators, DOM updates)
  • Logs inbox URL changes and session state to ~/.logs/gmb_watch.log for audit trails
  • Gracefully handles reconnection on network interruption

Infrastructure & Integration Points

Email Filtering & Lead Enrichment

Downstream from lead capture, we planned integration with Gmail (via the Google API client in the same venv) to:

  • Pull full inquiry emails from carole@sailjada.com inbox
  • Cross-reference GetMyBoat lead IDs with email sender addresses
  • Extract structured data (vessel type, rental duration, location preferences) via templated email parsing

This required confirming MX records for sailjada.com (Zoho Mail) and verifying Gmail token shape for OAuth2 flows.

VCS & Artifact Management

All six modules live in ~/Documents/repos (presumably under version control). Configuration is externalized:

  • Credentials stored in a Git-ignored .env.local or secrets manager
  • Playwright profile paths configurable via environment variables
  • Logs written to ~/.logs/ with rotation (important for long-running watcher processes)

Key Decisions & Trade-Offs

Headed vs. Headless Browsers: We started with a headed session to debug navigation and 2FA interactively, then plan to switch to headless once the URL discovery logic stabilizes. Headed mode adds latency but invaluable visibility during development.

Profile Persistence vs. Fresh Auth: Persistent profiles reduce overhead but introduce state management complexity. We mitigate this with session freshness checks and automatic re-auth on cookie expiration.