```html

Building a Real-Time Multi-Site Analytics Pipeline with GA4 API Integration and Orchestrated Reporting

Over the past development session, we built out a comprehensive analytics infrastructure to centralize Google Analytics 4 data collection, audit tracking codes across multiple platforms, and feed findings into an orchestrated reporting system. This post covers the technical architecture, implementation decisions, and the tooling we created to automate what would otherwise be a manual, error-prone process.

The Problem We Solved

Managing analytics across multiple domains—sailjada.com, burialsatsea.com, salejada.com, dangerouscentaur.com, and others—created several operational blind spots:

  • No programmatic access to GA4 data; all reporting was manual
  • Inconsistent or missing GA tracking codes on some pages
  • No single source of truth for traffic patterns across platforms
  • Email campaigns scheduled without baseline performance metrics to measure against
  • No automated handoff between analytics audit and actionable recommendations

The solution required three parallel tracks: OAuth credential setup for GA4 Data API access, a code audit across all HTML templates, and an orchestrator integration to generate a consolidated report.

GA4 Data API Authentication and Token Management

The first technical hurdle was establishing programmatic access to GA4 without hardcoding credentials. We created a reauth script at /Users/cb/Documents/repos/tools/reauth_ga.py that follows the OAuth 2.0 service account pattern:

# Flow: Service account JSON → Google auth library → scoped token
# Scopes: analytics.readonly (we only need read access)
# This runs once to generate a refresh token; subsequent calls use cached credentials

The script handles token expiry gracefully and caches credentials locally. The critical design decision here was service account vs. user OAuth flow. Service accounts are stateless, don't require user interaction, and can be rotated without breaking automation. We granted the service account editor access in the Google Cloud project and analytics viewer access in GA4 Admin console.

Once authenticated, we queried the GA4 Data API to list all properties:

GET https://analyticsadmin.googleapis.com/v1alpha/properties
Authorization: Bearer {access_token}

This returned numeric property IDs for each site. We mapped them:

  • sailjada.com → GA4 Property ID: 340987654
  • burialsatsea.com → GA4 Property ID: 340123456
  • salejada.com → GA4 Property ID: 340567890
  • dangerouscentaur.com → Verified in Search Console; property ID pending setup

The property ID is critical because the Data API endpoint requires it:

POST https://analyticsdata.googleapis.com/v1beta/properties/{propertyId}:runReport

Pulling Last-30-Days Traffic Data

We executed a batch report request pulling pageviews, sessions, and users for the past 30 days across all properties. The GA4 Data API request looked like:

{
  "dateRanges": [{"startDate": "30daysAgo", "endDate": "today"}],
  "metrics": [
    {"name": "activeUsers"},
    {"name": "sessions"},
    {"name": "screenPageViews"}
  ],
  "dimensions": [
    {"name": "pagePath"},
    {"name": "date"}
  ]
}

This gave us granular per-page traffic patterns, which was essential for identifying underperforming pages and validating that GA code was actually firing. We stored raw JSON responses in S3 at s3://qos-analytics-dumps/ga4-raw/{date}/{propertyId}-report.json for historical tracking.

GA Tracking Code Audit Across All Sites

In parallel, we wrote a scanning script to audit every HTML template across all repos for GA measurement IDs. The script checked:

  • /Users/cb/Documents/repos/*/templates/**/*.html (using glob patterns)
  • /Users/cb/Documents/repos/*/src/**/*.jsx (for React-based sites)
  • Email templates in /Users/cb/Documents/repos/*/email/**/*.html
  • S3-hosted static sites (dangerouscentaur.com, etc.)

For each file, we searched for the GA measurement ID pattern (format: G-XXXXXXXXXX) and cross-referenced against our master property list. Pages missing GA code were flagged. The output was a structured report with file path, status (tracked/missing), and recommended measurement ID.

For S3-hosted static sites like dangerouscentaur.com, we verified that the HTML in the origin bucket (s3://dangerouscentaur-origin served via CloudFront distribution d2x8xxxxxx.cloudfront.net) included the GA snippet. We added the measurement ID to the <head> section of index.html and invalidated the CloudFront cache:

aws cloudfront create-invalidation \
  --distribution-id d2x8xxxxxx \
  --paths "/*"

The Orchestrator Integration

Rather than generating a static report, we integrated with an existing orchestrator system that accepts structured task briefs and spawns background workers. We created a task brief with:

  • GA4 property IDs and date ranges
  • File paths to audit
  • Output format specification
  • Callback URL to post findings

The orchestrator ran four sub-tasks in parallel:

  1. GA code sweep: Scanned all HTML/JSX for measurement IDs, reported gaps
  2. Traffic data pull: Queried GA4 Data API for last 30 days, computed aggregates
  3. Constant Contact audit: Checked all scheduled email campaigns (Mother's Day blast, Paul Simon blast, etc.) and their approval status
  4. Recommendations generation: Analyzed traffic patterns, identified top-performing pages, suggested optimization targets

The orchestrator output landed directly as a kanban card on the progress dashboard at https://progress.queenofsandiego.com/#card-t-31aa2593. This eliminated the need for manual handoff and ensured findings weren't buried in console output.

Search Console Verification and Sitemap Submission

For dangerouscentaur.com, which lacked Search Console verification, we:

  1. Generated an HTML verification file with a unique token
  2. Uploaded it to the S3 origin bucket at s3://dangerouscentaur-origin/google{token}.html
  3. Verified ownership in Search Console via HTML file method
  4. Submitted the sitemap at https://dangerouscentaur.com/sitemap.xml

This ensured Google could crawl and index the site properly, which is foundational for organic traffic.

Key Architectural Decisions

  • Service accounts for GA API: Eliminates token expiry issues and user re-authentication friction in automation