§ · Case study Federal · undisclosed MLOps · Geospatial ML Production Engineering

99.6% ML inference cost reduction for global geospatial workloads.

How five compounding engineering defects in a research-origin codebase were generating $26,000/month in avoidable cloud spend — and how a ground-up production rebuild brought that to $90. No new algorithms. No changed requirements.

99.6%
Cost reduction
$26,200 → $90/mo
111M
Chips processed
full program workload
35×
Throughput gain
from batching alone
50+
Gov't validations
independent eval cycles
100%
GPU eliminated
A100 → CPU only

§ 01 — Program context and constraintsAn inherited notebook. A $26,200/month billing surprise.

This engagement was a federally-funded geospatial intelligence program requiring automated detection of heavy construction activity from multispectral satellite imagery. The scope was planetary: geographic coverage 10–20× the size of Moscow, with imagery revisits every two weeks spanning 5–10 year historical windows per area of interest.

The program's primary evaluation metric was cost per unit of processed area — not accuracy alone, not throughput alone. Cost efficiency was a first-class scored deliverable, validated by the government client across 50+ independent release cycles throughout the program lifecycle. Budget overruns were not recoverable.

The baseline inference system was inherited from a university research partner and delivered as Jupyter notebooks. My mandate was to transform that research artifact into a production-grade pipeline. What the notebooks contained was a set of compounding engineering defects that, left unaddressed, would have made the program economically unviable.

§ 02 — Workload sizing111 million chips. Every per-chip inefficiency multiplied.

Understanding the cost problem requires understanding the scale. Every per-chip inefficiency multiplied across 111 million chips.

ParameterValueDerivation
Coverage per run~37,500 km²Midpoint of 10–20× Moscow (2,511 km²)
Image resolution2m/pxProgram specification
Chip size256 × 256 px= 512m × 512m ground footprint
Chips per time step~570,00037,500 km² ÷ (0.512 km)²
Revisit cadence26/yearEvery 2 weeks
Historical window7.5 yearsMidpoint of 5–10 year range
Total chips~111 million570,000 × 26 × 7.5

The model itself was a mid-sized PyTorch CNN operating on multispectral 4-band imagery at 2m resolution — a pure inference workload with no training in the production loop. The problem was not model complexity. It was everything built around the model.

§ 03 — Baseline architecture and cost$32/hr GPU instances. Regardless of AOI size.

The baseline system used high-end GPU compute instances — multi-A100 configurations. Cost figures use current AWS on-demand public rates as a conservative proxy; actual contract rates were negotiated. The structural cost drivers and proportional reduction are accurate regardless of the specific rate applied.

# BASELINE ARCHITECTURE (per-sprint) [ Region A ] ──▶ GPU Instance ($32/hr) ──▶ Jupyter Notebook ──▶ Output [ Region B ] ──▶ GPU Instance ($32/hr) ──▶ Jupyter Notebook ──▶ Output [ Region C ] ──▶ GPU Instance ($32/hr) ──▶ Jupyter Notebook ──▶ Output same instance class for a 50,000 km² metro AND a 100×100m single site

AOI sizes varied enormously across the program — some regions covered tens of thousands of square kilometers, others covered areas as small as a single 100×100m construction site. Both triggered identical instance configurations. Small AOIs generated near-zero GPU utilization before teardown, billing GPU-hours for a handful of forward passes.

# Baseline monthly cost ~20 active regional containers per sprint × $32.77/hr per instance × ~20 billed hours per sprint × 2 sprints/month = ~$26,200/month # Effective GPU utilization during billed hours: <20%

§ 04 — Root cause analysis — five defectsEach independently fixable. Together, multiplicative.

Each defect below was independently fixable. Together they were multiplicative: the waste from each compounded the savings from the one before it. Addressing all five was necessary to reach the final reduction.

DEFECT 01 Per-Region GPU Fan-Out with No Size Scaling Critical

One GPU container per geographic region, irrespective of AOI size. Infrastructure cost was a function of region count, not workload volume. A 50,000 km² metro and a 100×100m single-site monitoring target triggered identical $32/hr instance spin-ups.

# Cost asymmetry example 100×100m AOI: ~38 chips total Inference time (batch=1): <1 second of GPU compute Instance billed: 20+ hours at $32.77/hr = $655 # $655 to run 38 forward passes

Fix: Regional workloads consolidated into a single shared inference queue. AOI size determined chip count in the queue, not instance count.

DEFECT 02 Single-Sample Inference — No Batching Critical

The model ran inference on one 256×256 chip at a time. On a modern GPU, single-sample forward pass time is approximately 8–12ms — dominated by kernel launch latency and host-device memory transfer overhead, not arithmetic throughput. The GPU's parallel compute capacity was almost entirely idle between launches.

# Throughput comparison Batch 1: ~100 chips/sec → 111M chips in ~308 hours Batch 64: ~3,500 chips/sec → 111M chips in ~8.8 hours Throughput gain: ~35× from batching alone
On GPU utilization: Single-sample inference on a multi-A100 instance is approximately equivalent to running a desktop GPU at $32/hr. The arithmetic throughput of the hardware — the entire justification for the cost — is never reached. The instance is a very expensive memory bus.
# Before: single-sample loop
for chip_path in chip_list:
    results.append(model(preprocess(load_chip(chip_path))))

# After: batched queue
def batched_inference(chip_queue, model, batch_size=64):
    for batch in iter_batches(chip_queue, batch_size):
        tensors = torch.stack([preprocess(c) for c in batch])
        with torch.no_grad():
            yield from zip(batch, model(tensors))
DEFECT 03 Band Duplication — Structural Model Corruption Critical

The model was originally architected for 8-band multispectral input. The actual sensor data was 4-band. Rather than fix the model, the research code duplicated the 4-band tensor along the channel dimension to produce a synthetic 8-band input.

# Research workaround
band_4      = load_imagery(scene)           # shape: (4, H, W)
band_8_fake = torch.cat([band_4, band_4], dim=0)
# bands 0-3 == bands 4-7 — perfectly correlated

Consequences at three levels. Representation quality: first-layer filters received perfectly correlated channel pairs across every training example, degrading spectral representations from initialization. Memory cost: doubled input tensor size on every forward pass. Batch ceiling: compressed maximum viable batch size on any fixed VRAM budget.

8-band (duplicated): 64 × 8 × 256 × 256 × 4 bytes = ~134MB per batch 4-band (correct): 64 × 4 × 256 × 256 × 4 bytes = ~67MB per batch

Fix: Ground-up retrain on native 4-band input. I defined acceptance metrics and assembled the training dataset. The university team executed retraining under those constraints. The architectural decision — retrain rather than patch weights — was mine.

DEFECT 04 Masked Pixels Included in Tensor Operations High

Valid imagery chips contained substantial no-data regions — cloud cover, cloud shadow, off-nadir edge artifacts, and acquisition gaps. The pipeline converted these to zero-filled pixels and submitted full tensors. Inference executed on every pixel, including those already flagged invalid by the mask layer.

Average masked fraction per chip: ~35% Net compute reduction from mask exclusion: ~25–35%
def build_inference_queue(chip_paths, mask_paths, threshold=0.5):
    return [
        (chip, mask)
        for c, m in zip(chip_paths, mask_paths)
        for chip, mask in [(load_chip(c), load_mask(m))]
        if 1.0 - mask.mean() >= threshold
    ]  # ~35% smaller queue
DEFECT 05 GPU Instance Used for CPU-Bound Raster Preprocessing High

Image clipping — extracting 256×256 chips from large GeoTIFF source scenes — ran on the GPU instance. This is pure CPU I/O work: file reads, coordinate transforms, pixel extraction via rasterio and GDAL. No GPU instructions are issued. Preprocessing produced sustained near-zero GPU utilization while billing GPU rates for work a CPU-only instance costing 1/50th the price could handle equivalently.

# Preprocessing: ~40–60% of GPU instance wall time On GPU instance: 0.5 × $32.77/hr × 20 hrs = ~$328/sprint On CPU instance: 0.5 × $0.68/hr × 20 hrs = ~$6.80/sprint # ~48× cost premium for identical work

Fix: Preprocessing decoupled to a CPU-only fleet. Airflow DAG enforced stage separation — preprocessing completed before inference instances were triggered. Inference was never idle waiting for chips.

§ 05 — The production rebuildBuilt from scratch. Stage-separated. CPU-only inference.

The research partner provided model architecture, training data, and Jupyter notebooks. The production system was built from scratch. Ownership breakdown:

  • Fully mine: classification pipeline end-to-end — chip extraction, mask filtering, queue management, batched inference, aggregation, all Python code, all Docker images, Airflow DAG
  • Architecture lead: change detection stage — overall pipeline architecture and integration contracts
  • Directed, not implemented: model retrain — I set acceptance metrics and provided training dataset; university team executed
  • Inherited: research model weights and original architecture (university partner)

Pipeline architecture

STAGE 1: Raster Preprocessing (CPU-only, decoupled) ├── Source scene ingestion (GeoTIFF) ├── Coordinate-aligned chip extraction (rasterio/GDAL) ├── Mask layer generation └── Chip queue → S3 STAGE 2: Chip Filtering ├── Mask evaluation per chip ├── Below-threshold chips dropped (~35% reduction) └── Valid queue → inference worker STAGE 3: Batched Inference (CPU fleet — shared, not per-region) ├── Dynamic batch assembly (batch_size=16–32) ├── 4-band corrected model forward pass └── Per-chip result emission STAGE 4: Result Aggregation └── Region-level output → downstream scoring

Airflow DAG

with DAG('inference_pipeline', schedule_interval='@biweekly') as dag:
    preprocess   = DockerOperator(task_id='raster_preprocessing',
                                   image='inference/preprocess:slim')
    filter_queue = PythonOperator(task_id='mask_filter_chip_queue',
                                   python_callable=build_inference_queue)
    inference    = DockerOperator(task_id='batched_inference',
                                   image='inference/model:slim')
    aggregate    = PythonOperator(task_id='result_aggregation',
                                   python_callable=aggregate_regional_outputs)
    preprocess >> filter_queue >> inference >> aggregate

Docker image slimming

FROM pytorch/pytorch:2.x-cuda-runtime AS base
# runtime CUDA libs only — no cudnn-dev, no compiler
COPY requirements-inference.txt .
RUN pip install --no-cache-dir -r requirements-inference.txt
# pinned deps: torch, rasterio, numpy, boto3
COPY src/inference/ /app/
CMD ["python", "-m", "inference.worker"]
# ~400–600MB vs 2–3GB baseline · ~75% reduction across 50+ pull cycles

Before / after

Before
  • Per-region GPU fan-out, 20 concurrent instances
  • Single-sample inference (batch=1)
  • 4-band input duplicated to fake 8-band
  • Masked pixels included in tensor ops
  • Raster preprocessing on GPU instance
  • Jupyter notebooks, no orchestration
  • 2–3 GB Docker images
  • A100 GPU required
After
  • Single shared queue, 3-node CPU fleet
  • Dynamic batching (batch=16–32)
  • Native 4-band model, retrained from scratch
  • Mask-filtered queue, ~35% smaller
  • Preprocessing decoupled to CPU-only stage
  • Airflow DAG, stage-separated, reproducible
  • 400–600 MB slim inference images
  • CPU-only inference, GPU eliminated

§ 06 — Results$26,200 → $90. No new algorithms. No changed requirements.

DimensionBaselineOptimized
Compute~20× GPU instances (multi-A100)3× CPU inference nodes
Inference modeSingle-sample, per-region isolatedUnified batched queue (batch 16–32)
Model input4-band duplicated to 8 (corrupt)Native 4-band (retrained)
MaskingZero-filled passthroughExcluded pre-queue (~35% fewer chips)
PreprocessingCo-located on GPU instanceDecoupled CPU-only stage
Docker image2–3 GB (full CUDA + dev tools)~400–600 MB (runtime only)
OrchestrationJupyter notebooksDockerized Airflow DAGs
Monthly cost~$26,200~$90
Cost reduction99.6%
Model accuracyBaseline (program eval)Within program threshold
External validation50+ independent government evaluations
# Cost verification Baseline: ~$26,200/month Optimized: 3 × CPU nodes × $0.68/hr × ~45 active hrs/month = $91.80/month Reduction: ($26,200 − $90) / $26,200 = 99.66% # The "99%" claim is not rounded up. It's rounded down.

The pipeline processed the full program workload within the government client's cost and accuracy constraints across 50+ independently evaluated release cycles. Cost efficiency held through the full evaluation period without architectural rework.

§ 07 — LessonsFive things that apply to the next one.

"The first question when inheriting research code should not be 'does it produce correct outputs' but 'what will this cost at the volume we actually need.'"

Research code is not a cost baseline

Jupyter notebooks are validation artifacts, not infrastructure. The cost structure of research code — single-sample loops, monolithic compute, no stage separation — is appropriate for a lab environment and catastrophic at production scale.

Workarounds compound silently

The band duplication hack was a single line of code that simultaneously degraded model quality, doubled memory bandwidth consumption, and compressed batch size headroom. Each consequence was invisible until you looked. Research workarounds tend to solve the immediate problem (the notebook runs) while embedding structural costs that multiply at scale.

Right-sizing compute is not a tradeoff

Running CPU-bound preprocessing on a multi-A100 instance is not a "good enough for now" decision — it is a billing error. Compute class mismatches between task type and instance type produce no benefit in exchange for the excess cost. The fix is always separation.

Fan-out patterns require size awareness

Uniform infrastructure per logical unit only makes sense when logical units are uniform in size. When AOIs range from 50,000 km² to 100m×100m, the appropriate abstraction is a shared queue where cost is proportional to actual chip volume — not a per-region instance where cost is proportional to region count regardless of size.

Batch size is free throughput at scale

The transition from batch=1 to batch=64 required approximately 20 lines of code and delivered a 35× throughput improvement. Against 111 million chips, that 35× translated directly into compute hours billed.

← Back to CV · Print CV

Production Engineering · Geospatial ML · MLOps · C·C ◆ 2025