Automating GetMyBoat Lead Ingestion with Playwright: Building a Headless Browser Pipeline for SPA Navigation
What Was Done
We built an end-to-end lead scraping pipeline for GetMyBoat, a single-page application (SPA) that doesn't expose API endpoints for lead data. The solution uses Playwright for headless browser automation, persistent browser profiles to maintain session state, and structured data extraction from DOM elements. The pipeline captures inbox conversations, parses thread metadata, and generates markdown reports with pipeline value calculations — all saved to persistent storage and emailed for downstream analysis.
Technical Architecture
The solution is composed of modular Python scripts, each handling a distinct phase of the automation:
/tmp/gmb_login.py— Establishes authenticated session via Playwright headed mode, prefilling credentials and waiting for human interaction or automatic login completion/tmp/gmb_session.py— Manages persistent browser profile storage to preserve authentication across script executions/tmp/gmb_explore.py— Navigates SPA routes to locate the inbox URL by observing network requests and DOM mutations/tmp/gmb_inbox.py— Opens owner inbox, waits for dynamic content load, and captures thread list/tmp/gmb_scrape.py— Extracts conversation panels, parses thread metadata (sender, message count, timestamps), and serializes to structured data/tmp/gmb_watch.py— Long-running observer that monitors inbox navigation in real-time to isolate true GetMyBoat platform notifications from external forwarded emails/tmp/gmb_manual.py— Fallback script for manual inspection of parsed conversation trees/tmp/gmb_lead_scan.py— Read-only lead analysis that filters inbox for qualified opportunities without modifying state
Critical Technical Decisions & Rationale
Playwright Over Selenium
We chose Playwright instead of Selenium because GetMyBoat's SPA uses modern async JavaScript for inbox rendering. Playwright's native event-driven architecture handles promises and DOM mutations more reliably. Its ability to wait for specific selectors and network idle states is essential for SPAs where content loads asynchronously.
Persistent Browser Profiles
Rather than re-authenticating on every run, we store the Chromium profile in a durable location (e.g., /Users/cb/Documents/repos/gmb-profiles/carole-jada) to reuse authenticated sessions. This reduces login failures and respects GetMyBoat's rate-limiting by avoiding repeated credential submission. Profile includes cookies, localStorage, and sessionStorage — all preserved across script invocations.
Headed Mode with User Interaction Option
The login scripts run in headless=False mode to allow human login when automation fails. This hybrid approach — attempted prefill of credentials with fallback to manual entry — handles both 2FA prompts and dynamic anti-bot challenges that headless mode cannot navigate. Once logged in, the profile is persisted for subsequent headless runs.
DOM-Based Data Extraction Over Network Interception
We parse conversation threads directly from the rendered DOM rather than intercepting API calls. GetMyBoat's inbox does not expose a dedicated API; the SPA renders data fetched internally. By waiting for specific selectors (e.g., [data-testid="thread-item"]) and extracting text content, we ensure we capture exactly what the user sees without reverse-engineering undocumented endpoints.
Implementation Details
Session Establishment
# Pseudocode: Login and persist profile
browser = await chromium.launch(headless=False)
context = await browser.new_context(
storage_state="/path/to/gmb-profiles/carole-jada/state.json"
)
page = await context.new_page()
await page.goto("https://www.getmyboat.com/login")
await page.fill("input[name='email']", email)
await page.fill("input[name='password']", password)
await page.click("button[type='submit']")
await page.wait_for_load_state("networkidle")
await context.storage_state(path="/path/to/gmb-profiles/carole-jada/state.json")
Inbox Navigation & Thread Capture
The SPA does not display inbox URL in the address bar until navigation completes. We watch for route changes by polling the page URL and inspecting the DOM for inbox markers:
# Wait for inbox to render
await page.wait_for_selector("[data-testid='inbox-container']", timeout=10000)
# Extract all threads
threads = await page.query_selector_all("[data-testid='thread-item']")
for thread in threads:
sender = await thread.query_selector(".thread-sender")
message_count = await thread.query_selector(".message-count")
last_message_time = await thread.query_selector(".timestamp")
thread_data = {
'sender': await sender.text_content(),
'message_count': await message_count.text_content(),
'last_message_time': await last_message_time.text_content()
}
threads_list.append(thread_data)
Data Serialization & Report Generation
Extracted threads are serialized to JSON, then converted to markdown with pipeline calculations. Each thread includes sender name, message count, and estimated lead value based on inquiry type and message volume:
# Generate markdown report
report = "# GetMyBoat Inbox Report\n\n"
total_pipeline = 0
for thread in threads_data:
estimated_value = estimate_lead_value(thread['sender'], thread['message_count'])
total_pipeline += estimated_value
report += f"## {thread['sender']}\n"
report += f"- Messages: {thread['message_count']}\n"
report += f"- Last Activity: {thread['last_message_time']}\n"
report += f"- Est. Value: ${estimated_value}\n\n"
report += f"\n**Total Pipeline Value: ${total_pipeline}**\n"
# Save to persistent location
with open("/Users/cb/Documents/repos/jada-ops/gmb-report.md", "w") as f:
f.write(report)
Infrastructure & Data Storage
- Browser Profiles: Stored in
/Users/cb/Documents/repos/gmb-profiles/with subdirectories per account (e.g.,carole-jada). State files are JSON snapshots of cookies and storage, enabling profile reuse without re-authentication. - Reports: Generated markdown reports saved to
/Users/cb/Documents/repos/jada-ops/with timestamped filenames for historical tracking. - Virtual Environment: Dedicated Python venv with Playwright and Google API client libraries. Playwright installed via
pip install playwrightfollowed byplaywright install chromiumto download matching Chromium binary. - Email Integration: Reports emailed via Gmail helper scripts using OAuth2 tokens (stored securely in system keychain, not in code).
Key Challenges & Solutions
Challenge: Playwright installation required compatible Chromium build. Different systems (Intel/