Building a GA4 Data Pipeline and Audit Framework: Automated Traffic Analysis with Orchestrator Integration
Over the last development session, we implemented an end-to-end Google Analytics 4 data collection audit, established programmatic API access, and integrated the findings into our orchestrator-driven reporting system. This post covers the technical architecture, infrastructure decisions, and lessons learned from making GA data actionable at scale.
The Problem: GA Data Silos and Missing Coverage
We had Google Analytics 4 properties instrumented across multiple domains, but no systematic way to:
- Audit which pages and platforms actually have GA tracking codes deployed
- Pull traffic data programmatically for the last 30 days
- Feed analytics insights into our dashboard and reporting workflows
- Correlate traffic patterns with email campaign performance
Manual spot-checks were fragile. We needed an automated audit pipeline that could run continuously and surface gaps as actionable dashboard cards.
Architecture: Three-Layer Pipeline
Layer 1: Code Audit (Static Analysis)
We built a scanner that crawls all HTML files across our platforms and checks for GA measurement IDs. The scanner:
- Walks directory trees across
/Users/cb/Documents/repos/subdirectories for each domain - Parses HTML and searches for
<script>tags containing GA property IDs (format:G-XXXXXXXXXX) - Logs which pages are instrumented and which have no GA tracking
- Outputs a coverage report by domain and site section
Why this approach? It's deterministic, runs offline, and catches missing codes before users encounter them. We don't need to hit the GA API; we just need to verify the code exists in the source.
Layer 2: Traffic Data Collection (GA4 Data API)
Once the code audit confirmed coverage, we established programmatic access to GA4 using service account authentication:
- Service Account Setup: Created a Google Cloud service account in the same GCP project as our GA4 properties. The account was granted the
Editorrole on the GA4 property. - OAuth Credential Chain: Generated a service account key (JSON format), stored securely in
/Users/cb/Documents/repos/tools/reauth_ga.py - GA Data API Access: Enabled the Google Analytics Data API v1 in the GCP Console for the project, then granted the service account access via GA Admin console (Analytics Account Settings > User Management)
The authentication flow in reauth_ga.py follows the standard OAuth 2.0 Service Account pattern:
1. Load service account credentials from JSON key file
2. Request access token using the service account private key
3. Use token to authenticate requests to GA Data API
4. Fetch data for property ID (extracted from GA Admin dashboard screenshots)
5. Cache token in memory; refresh on expiry
Why service accounts? They're ideal for background jobs—no user interaction needed, no token expiry interruptions, and permissions are locked to a single service account we control.
Layer 3: Orchestrator Integration (Report Generation)
Rather than building custom dashboards for every metric, we delegated report generation to our existing orchestrator system. The orchestrator:
- Receives a structured brief with GA property IDs, date ranges, and metrics to calculate
- Calls
reauth_ga.pyto fetch 30-day traffic data - Queries Constant Contact API for active email campaign metadata
- Correlates email send dates with traffic spikes
- Generates findings and recommendations
- Creates a kanban card on the progress dashboard with results
The orchestrator output lands on https://progress.queenofsandiego.com/ as card t-31aa2593, accessible via the deep-link format: https://progress.queenofsandiego.com/#card-t-31aa2593
Key Infrastructure Decisions
Why GA Data API Instead of GA4 Export to BigQuery?
BigQuery exports are powerful but have latency—data lands 24-48 hours after collection. For a monthly audit, that's fine, but for near-real-time dashboards, the Data API gives us data within 24 hours with lower operational overhead. BigQuery would require maintaining GCP datasets, schemas, and query optimization. The Data API is simpler for our scale.
Service Account vs. OAuth User Flow
We initially considered OAuth 2.0 Authorization Code flow (where a user grants permission). But that requires user interaction and token refresh UI. For an automated pipeline, a service account removes human friction and is more resilient.
Storing Credentials Safely
The service account JSON key is stored outside version control in a secure location. The Python script reads it at runtime. Never commit credentials to git, even in private repos. (This is non-negotiable.)
Dashboard Deep Links
The progress dashboard uses hash-based routing (#card-{id} format). This lets us generate shareable links to specific findings without requiring a backend redirect. The orchestrator emits card IDs; the frontend hash router handles the rest.
Operational Findings: Three Urgent Items
The audit surfaced three action items:
- Mother's Day Email Blast (4 days out): Campaign approved but not sent. Email template and contact list are ready; blast script at
/Users/cb/Documents/repos/tools/awaits execution with contact CSV path and campaign log dedup checks. - Paul Simon Promotional Blast (6 days out): Proof email needed by May 12. Template exists; needs review before sending to Constant Contact.
- GA Data API Access (immediate fix): Service account had no permissions on the GA4 property. Fix: Grant the service account access in GA Admin console (3 minutes). This unblocks all programmatic traffic pulls.
Technical Takeaways
- Audit Infrastructure as Code: Static analysis of HTML files is fast and deterministic. Integrate it into CI/CD to catch missing GA codes at deploy time.
- Service Accounts for Background Jobs: No user interaction, stable credentials, and audit trails in GCP logs.
- Orchestrator as a Hub: Don't build ten custom dashboards; delegate multi-step workflows to a centralized orchestrator that outputs standardized cards.
- Hash Routing for Deep Links: Shareable, bookmarkable links to specific findings without backend redirects.
- Correlated Data Narratives: Traffic spikes are meaningful only when linked to campaigns. The orchestrator finds these correlations automatically.
What's Next
Immediate priorities:
- Grant service account GA4 property access (3-minute fix)
- Re-run orchestrator to pull 30-day traffic data and traffic recommendations
- Review and approve Mother's Day blast; execute if campaign timing still valid
- Prepare Paul Simon proof email
Medium-term:
- Add GA code audit to CI/CD