Automating Monthly Revenue Reporting: Building a Google Sheets Pipeline for Charter Operations
Charter operations generate revenue across multiple booking channels and crew configurations. Reconciling these sources into a single, auditable monthly statement requires pulling data from Gmail booking confirmations, DynamoDB charter records, and manual trip sheets—then synthesizing them into a formatted Excel workbook that matches accounting expectations. This post covers the infrastructure and automation patterns we built to eliminate manual data entry for Sheraton monthly reports.
The Problem: Multi-Source Revenue Reconciliation
Prior to this work, monthly statements were assembled manually:
- Gmail searches for booking confirmations (from Jennifer Sanderson, the booking contact)
- Manual lookups in the crew-dispatch DynamoDB table for charter details
- Cross-reference with captain trip sheets (stored as Google Sheets) for revenue figures
- Reconstruction of the prior month's statement format to match
- Manual Excel workbook creation with proper formula structure and formatting
This process took 2–3 hours and introduced transcription errors. The goal was to automate the data-gathering phase while preserving the formatted output that accounting and the charter company required.
Architecture: Multi-Service Data Pipeline
We designed a four-stage pipeline running on an EC2 instance with orchestration scripts in Python:
Stage 1: Gmail API Integration for Booking Discovery
The booking contact (Jennifer Sanderson) receives all charter confirmations. We query Gmail using the Gmail API to extract booking records:
gmail_contacts.py # Search Gmail for bookings from Jennifer Sanderson
gmail_diag.py # Diagnose token refresh and API connectivity
gmail_jennifer.py # Query Jennifer's email for booking summaries
Why Gmail API over email forwarding: Email forwarding to a structured inbox is fragile; a single misconfigured rule loses data. The Gmail API lets us query the source of truth (Jennifer's mailbox) on demand, with full search syntax (date ranges, sender filters, subject keywords). Queries are idempotent—run the same search twice and get the same results.
Token management: Google OAuth tokens expire after 1 hour. We patched reauth_google.py on the EC2 instance to handle refresh-token rotation transparently, so long-running jobs don't stall mid-execution. The patched script validates the token before each API call and refreshes silently if needed.
Stage 2: DynamoDB Record Retrieval
Charter metadata lives in the crew-dispatch DynamoDB table. Once we identify booking dates from Gmail, we query DynamoDB for the full charter record:
- Table:
crew-dispatch(us-east-1 and us-west-2 replicas) - Schema: Charter records with guest names, boat assignments, crew rosters, and total amounts
- Access pattern: Query by date range (e.g., all charters in May 2024)
build_sheet.py # Fetch charter records from DynamoDB and structure them
sheet_inspect.py # Validate schema and inspect record structure
sheet_diag.py # Diagnose DynamoDB connectivity
Why DynamoDB over a relational DB: The charter operations data was already stored in DynamoDB for high-write throughput (crew availability updates, chat logs, dispatch updates). Introducing a separate SQL database would require a secondary ETL pipeline. Querying the existing DynamoDB tables is simpler and doesn't duplicate data.
Stage 3: Captain Trip Sheet Enrichment
Captains maintain individual Google Sheets with detailed trip records (guest names, revenue figures, fuel costs, tips). These sheets live in the JADA Business folder on Google Drive. For each charter, we cross-reference the crew captain's sheet to extract verified revenue:
build_may_tab.py # Read captain trip sheets and extract revenue figures
Reportable revenue rule: A charter's revenue is "reportable" only if the captain's trip sheet includes a dollar figure in the designated revenue cell. This rule ensures we don't report estimated or pending revenue. Inspecting prior April reports showed us the exact cell placement and formula structure (e.g., row totals, footnote placement).
Stage 4: Workbook Generation and Upload
We generate a multi-tab Excel workbook using openpyxl (since the environment may not have xlsxwriter):
build_sheet.py # Generate .xlsx with Template tab and month-specific tabs
upload.py # Upload completed workbook to Google Drive
The workbook structure matches accounting expectations:
- Template tab: Read-only reference showing column headers and formula patterns
- Month tab (e.g., May): Populated with charter records, guest names, revenue figures, and row totals
- Number formatting: Currency columns formatted as USD, dates as MM/DD/YYYY
- Footnotes: Preserved from prior months (e.g., "Revenue per captain trip sheet")
Infrastructure and Deployment
EC2 host: ubuntu@34.239.233.28 (us-east-1 region, t3.medium instance)
Script locations:
/home/ubuntu/repos/queenofsandiego/— Main site and JADA ops scripts/home/ubuntu/repos/.secrets/— Google service account credentials (locked to 0700 permissions)/tmp/— Transient working files (scripts, intermediate .xlsx files)
Credentials storage: Google OAuth credentials (service account JSON and refresh tokens) are stored in /home/ubuntu/repos/.secrets/ with restrictive filesystem permissions. The SSH key used for EC2 access is stored locally as jada-key.pem in the user's SSH config, not committed to version control.
Data flow:
- Local machine: SSH into EC2, run
build_sheet.pyto pull Gmail, DynamoDB, and Google Sheets data - EC2: Script writes intermediate .xlsx to
/tmp/, uploads to Google Drive via Google Sheets API - Local machine:
scpthe .xlsx from EC2 to Mac for inspection and final delivery
Key Decisions and Trade-Offs
1. Why not fully automate delivery? The final workbook is still reviewed by hand before sending to the charter company. Automating the last-mile upload to the accounting system would require additional integration and approval workflow. Instead, we automate the data-gathering phase (80% of the work) and leave the final review and send step to a human.
2. Why pull captain trip sheets at report time instead of pre-populating? Captains update their trip sheets throughout the month, adding revenue figures and corrections. Querying at report time ensures we have the latest figures. Pre-computing a monthly snapshot early in the month would miss late updates.
3. Why use openpyxl instead of Google Sheets API directly? Google Sheets API is powerful but verbose for complex formatting (merged cells, currency formatting, multiple worksh