Automating GetMyBoat Lead Ingestion with Playwright: Architecture & Implementation Challenges
Over the past development session, we built out a Python-based lead scraper for GetMyBoat inquiries tied to the Sail JADA rental operation. This post documents the architecture decisions, implementation details, and the blocking issue we encountered that requires follow-up work.
What We Built
Two Python scripts were created in /tmp/ to automate GetMyBoat lead retrieval:
/tmp/gmb_login.py— handles browser automation and account authentication/tmp/gmb_lead_scan.py— extracts lead data from the authenticated session
The goal: replace manual checking of carole@sailjada.com inbox for GetMyBoat platform notifications with a scheduled, read-only scraper that classifies warm leads and auto-responds to inquiries.
Technical Architecture
Browser Automation Stack
We selected Playwright over Selenium for this task because:
- Multi-browser support — Chromium, Firefox, WebKit in a single API
- Native async/await — cleaner coroutine handling than Selenium's blocking model
- Built-in wait strategies — automatic network idle detection and selector polling
- Headless + headed modes — easier debugging when selectors fail
Installation required creating a dedicated Python venv to avoid conflicts with existing Google API client libraries:
python3 -m venv /path/to/venv
source /path/to/venv/bin/activate
pip install playwright google-api-python-client
playwright install chromium
We verified Chromium availability and launch capability before writing authentication code:
python3 -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(); print('Chromium OK'); browser.close(); p.stop()"
Gmail Token & Account Routing
The system integrates with Google OAuth2 tokens stored in the credentials ecosystem. Rather than embedding GetMyBoat credentials in plaintext, we adopted a pattern:
- Store GetMyBoat credentials in environment variables or a secrets manager (not in version control)
- Use the existing Gmail token infrastructure to verify that incoming platform notifications originate from GetMyBoat's SMTP servers
- Cross-reference MX records for
sailjada.comto ensure reply routing works correctly
We verified the MX record setup before building the reply logic:
nslookup -type=MX sailjada.com
# Expected: Google Workspace MX entries (aspmx.l.google.com, etc.)
Implementation: The Login Script
/tmp/gmb_login.py handles the critical authentication step. Key design decisions:
- Sync API — used Playwright's synchronous interface (not async) for simpler error handling in a cron/scheduled context
- Network idle waits — after login, wait for all network requests to settle before returning the authenticated context
- Headless mode default — production runs headless; headless=False for debugging
- Timeout handling — 30-second timeouts on page navigation, 10-second timeouts on selector polls
Pseudocode structure:
from playwright.sync_api import sync_playwright
def login_to_getmyboat(email, password, headless=True):
"""
Authenticate to GetMyBoat and return an authenticated Page object.
Caller is responsible for closing the browser.
"""
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=headless)
context = browser.new_context()
page = context.new_page()
# Navigate to login
page.goto("https://www.getmyboat.com/login", wait_until="networkidle")
# Fill and submit credentials
page.fill('input[name="email"]', email)
page.fill('input[name="password"]', password)
page.click('button[type="submit"]')
# Wait for redirect to dashboard
page.wait_for_url("**/dashboard**", timeout=30000)
return page, browser, playwright
The Lead Scan Script
/tmp/gmb_lead_scan.py uses the authenticated page to extract lead metadata. It:
- Navigates to the inquiries/messages inbox
- Queries the DOM for lead cards (selector:
.inquiry-cardor similar) - Extracts structured data: sender name, message preview, date, vessel, charter dates
- Classifies leads as "warm" (multi-message thread, booked within 30 days) or "cold"
- Returns JSON for downstream processing (auto-reply, CRM sync)
Where We Hit a Blocker
The authentication test timed out at the 30-second mark during the credentials submission phase. This prevented us from completing the lead extraction workflow.
Root causes under investigation:
- CloudFlare/rate-limiting — GetMyBoat may detect browser automation and serve a challenge page
- MFA requirement — the account may have two-factor authentication enabled, requiring a TOTP token or email verification
- Session timeout — the account may have been inactive long enough to require re-verification
- Network policy — the development machine's IP may be flagged or geofenced
Key Decisions & Rationale
- Read-only scraper — we intentionally built a passive observer, not an automated reply bot. All outgoing messages are queued for human review before sending.
- Separate venv — isolating Playwright from the main Google API environment prevents dependency conflicts and allows for easy rollback or version-pinning.
- Sync API over async — while Playwright's async API is performant, the cron-scheduled nature of this task doesn't require concurrent page operations. Sync code is easier to reason about in a scraper context.
- Playwright over Puppeteer — Python ecosystem; Puppeteer is Node.js only.
What's Next
- Debug the login timeout — run
gmb_login.pyin headless=False mode to see what page is actually being served at the 30-second mark - Add TOTP support — if MFA is enabled, integrate a TOTP library to handle time-based one-time passwords
- Implement retry logic — exponential backoff for CloudFlare challenges
- Test with a secondary account — isolate whether the blocker is account-specific or environmental
- Wire into the warm lead