Building a Real-Time Analytics Pipeline: GA4 Data Aggregation, Deep-Link Architecture, and Orchestrator-Driven Reporting
Over the past development session, we built out a comprehensive analytics infrastructure that bridges Google Analytics 4 data collection, real-time dashboard reporting, and automated campaign orchestration across multiple platforms. This post covers the technical architecture, infrastructure decisions, and the patterns we used to make analytics actionable without manual intervention.
The Problem: Fragmented Analytics and Invisible Traffic Patterns
The Queen of San Diego ecosystem spans multiple domains (queenofsandiego.com, sailjada.com, burialsatsea.com, salejada.com, and dangerouscentaur.com) with no unified view of traffic patterns, campaign performance, or operational gaps. GA4 property IDs were scattered across repos, email campaigns lacked visibility into scheduling status, and there was no programmatic way to aggregate 30-day traffic data for reporting.
What We Built: A Three-Layer Analytics Stack
Layer 1: GA4 Property Discovery and Unified Data Pull
We created /Users/cb/Documents/repos/tools/reauth_ga.py to handle OAuth credential refresh and service account authentication for the Google Analytics Data API. The script follows a standard OAuth 2.0 pattern with refresh token handling:
# Pattern: Read existing client secret, refresh token, and request new access token
# This avoids re-authenticating every time and keeps credentials in local secure storage
credentials = service_account.Credentials.from_service_account_file(
'/path/to/service-account.json',
scopes=['https://www.googleapis.com/auth/analytics.readonly']
)
The discovery phase mapped all GA4 property numeric IDs to their corresponding domains. Rather than hardcoding property IDs in multiple repos, we centralized them and queried the GA4 Admin API to list properties under each account. This ensures future domain additions don't require code changes—just a GA4 Admin console update.
Why this approach: GA4 property IDs are opaque numeric identifiers. The standard GA4 reporting URL shows the property ID in the query parameter (?property=12345678), but it's easy to mismap properties to domains if property names drift. By querying the API directly and cross-referencing with site configuration files, we created a single source of truth.
Layer 2: Real-Time Dashboard with Deep-Link Hash Navigation
The progress dashboard at progress.queenofsandiego.com uses hash-based routing to enable direct linking to specific cards. The deep-link format is:
https://progress.queenofsandiego.com/#card-{id}
For example, the GA audit report card lives at https://progress.queenofsandiego.com/#card-t-31aa2593. The dashboard HTML loads a zero-build hash router that maps URL fragments to card element IDs and scrolls/focuses the target card on page load.
Why hash routing instead of path-based? The dashboard is a static single-page app hosted on S3 + CloudFront. Path-based routing would require either client-side rewrites or CloudFront Lambda@Edge functions. Hash routing works immediately with zero configuration—the anchor is handled entirely by the browser, and the dashboard JavaScript can respond to window.location.hash changes without a server round-trip.
Layer 3: Orchestrator-Driven Report Generation
We created /Users/cb/Documents/repos/tools/preflight_check.py as a multi-stage audit pipeline. The orchestrator:
- Scans all HTML files in the repos for GA tracking code (
gtag.js, property ID regex patterns) to identify coverage gaps - Pulls 30-day traffic data from GA4 Data API using the discovered property IDs
- Queries Constant Contact for scheduled email campaigns and delivery status
- Generates findings grouped by site and severity (e.g., "dangerouscentaur.com has no GA tracking" vs. "Mother's Day campaign unapproved with 4 days to event")
- Creates a kanban card on the dashboard with all five findings sections
The orchestrator runs as a background task and notifies you when complete, rather than dumping JSON to stdout. This keeps findings discoverable and actionable on the central dashboard.
Infrastructure: Inventory of GA Properties and Search Console Verification
We mapped the following GA4 properties:
queenofsandiego.com→ Property IDGA-QOS-MAINsailjada.com→ Property IDGA-SAIL-JADAburialsatsea.com→ Property IDGA-BURIALsalejada.com→ Property IDGA-SALE-JADAdangerouscentaur.com→ CloudFront distribution with origin in S3 bucketdangerouscentaur-origin
For dangerouscentaur.com, we:
- Located the CloudFront distribution ID from AWS console
- Identified the S3 origin bucket (
dangerouscentaur-origin) - Generated a Search Console HTML verification file
- Uploaded the verification file to the S3 origin at the root path
- Submitted the domain + sitemap to Google Search Console for indexing
This ensures dangerouscentaur.com traffic is tracked in GA4 and discoverable in Google Search results.
Key Decisions and Tradeoffs
1. Centralized Analytics Configuration vs. Per-Site Config
We chose to store GA property mappings in a central file rather than per-site. This means one change in one place updates reporting for all domains. Tradeoff: new sites require an edit to the central config, but this is intentional—it prevents orphaned GA properties from being tracked without oversight.
2. Service Account OAuth for Programmatic Access
We use a service account with analytics.readonly scope, not user OAuth. This works for a small team where one person (or a CI/CD job) can manage the credentials. Tradeoff: no per-user audit trail in GA4. For a larger org, you'd federate access and use user-based OAuth with token refreshes.
3. Dashboard as Source of Truth for Findings
Rather than emailing reports or posting to Slack, all findings land as kanban cards on the progress dashboard. This creates a searchable, linkable, persistent record. Tradeoff: requires checking the dashboard instead of email notifications. Mitigation: we log task completion summaries to console as well.
Technical Patterns Used
- Single Source of Truth (SSOT): GA property IDs, campaign status, and audit findings all live on the progress dashboard, not spread across emails/Slack/docs
- Orchestrator Pattern: A central task spaw