99.6% ML Infrastructure Cost Reduction

§ 01 — Program context and constraintsAn inherited notebook. A $26,200/month billing surprise.

This engagement was a federally-funded geospatial intelligence program requiring automated detection of heavy construction activity from multispectral satellite imagery. The scope was planetary: geographic coverage 10–20× the size of Moscow, with imagery revisits every two weeks spanning 5–10 year historical windows per area of interest.

The program's primary evaluation metric was cost per unit of processed area — not accuracy alone, not throughput alone. Cost efficiency was a first-class scored deliverable, validated by the government client across 50+ independent release cycles throughout the program lifecycle. Budget overruns were not recoverable.

The baseline inference system was inherited from a university research partner and delivered as Jupyter notebooks. My mandate was to transform that research artifact into a production-grade pipeline. What the notebooks contained was a set of compounding engineering defects that, left unaddressed, would have made the program economically unviable.

§ 02 — Workload sizing111 million chips. Every per-chip inefficiency multiplied.

Understanding the cost problem requires understanding the scale. Every per-chip inefficiency multiplied across 111 million chips.

Parameter	Value	Derivation
Coverage per run	~37,500 km²	Midpoint of 10–20× Moscow (2,511 km²)
Image resolution	2m/px	Program specification
Chip size	256 × 256 px	= 512m × 512m ground footprint
Chips per time step	~570,000	37,500 km² ÷ (0.512 km)²
Revisit cadence	26/year	Every 2 weeks
Historical window	7.5 years	Midpoint of 5–10 year range
Total chips	~111 million	570,000 × 26 × 7.5

The model itself was a mid-sized PyTorch CNN operating on multispectral 4-band imagery at 2m resolution — a pure inference workload with no training in the production loop. The problem was not model complexity. It was everything built around the model.

§ 03 — Baseline architecture and cost$32/hr GPU instances. Regardless of AOI size.

The baseline system used high-end GPU compute instances — multi-A100 configurations. Cost figures use current AWS on-demand public rates as a conservative proxy; actual contract rates were negotiated. The structural cost drivers and proportional reduction are accurate regardless of the specific rate applied.

# BASELINE ARCHITECTURE (per-sprint) [ Region A ] ──▶ GPU Instance ($32/hr) ──▶ Jupyter Notebook ──▶ Output [ Region B ] ──▶ GPU Instance ($32/hr) ──▶ Jupyter Notebook ──▶ Output [ Region C ] ──▶ GPU Instance ($32/hr) ──▶ Jupyter Notebook ──▶ Output ↑ same instance class for a 50,000 km² metro AND a 100×100m single site

AOI sizes varied enormously across the program — some regions covered tens of thousands of square kilometers, others covered areas as small as a single 100×100m construction site. Both triggered identical instance configurations. Small AOIs generated near-zero GPU utilization before teardown, billing GPU-hours for a handful of forward passes.

# Baseline monthly cost ~20 active regional containers per sprint \times $32.77/hr per instance \times ~20 billed hours per sprint \times 2 sprints/month = ~$26,200/month # Effective GPU utilization during billed hours: <20%

§ 04 — Root cause analysis — five defectsEach independently fixable. Together, multiplicative.

Each defect below was independently fixable. Together they were multiplicative: the waste from each compounded the savings from the one before it. Addressing all five was necessary to reach the final reduction.

DEFECT 01 Per-Region GPU Fan-Out with No Size Scaling Critical

One GPU container per geographic region, irrespective of AOI size. Infrastructure cost was a function of region count, not workload volume. A 50,000 km² metro and a 100×100m single-site monitoring target triggered identical $32/hr instance spin-ups.

# Cost asymmetry example 100\times100m AOI: ~38 chips total Inference time (batch=1): <1 second of GPU compute Instance billed: 20+ hours at $32.77/hr = $655 # $655 to run 38 forward passes

Fix: Regional workloads consolidated into a single shared inference queue. AOI size determined chip count in the queue, not instance count.

DEFECT 02 Single-Sample Inference — No Batching Critical

The model ran inference on one 256×256 chip at a time. On a modern GPU, single-sample forward pass time is approximately 8–12ms — dominated by kernel launch latency and host-device memory transfer overhead, not arithmetic throughput. The GPU's parallel compute capacity was almost entirely idle between launches.

# Throughput comparison Batch 1: ~100 chips/sec \to 111M chips in ~308 hours Batch 64: ~3,500 chips/sec \to 111M chips in ~8.8 hours Throughput gain: ~35\times from batching alone

On GPU utilization: Single-sample inference on a multi-A100 instance is approximately equivalent to running a desktop GPU at $32/hr. The arithmetic throughput of the hardware — the entire justification for the cost — is never reached. The instance is a very expensive memory bus.

# Before: single-sample loop
for chip_path in chip_list:
    results.append(model(preprocess(load_chip(chip_path))))

# After: batched queue
def batched_inference(chip_queue, model, batch_size=64):
    for batch in iter_batches(chip_queue, batch_size):
        tensors = torch.stack([preprocess(c) for c in batch])
        with torch.no_grad():
            yield from zip(batch, model(tensors))

DEFECT 03 Band Duplication — Structural Model Corruption Critical

The model was originally architected for 8-band multispectral input. The actual sensor data was 4-band. Rather than fix the model, the research code duplicated the 4-band tensor along the channel dimension to produce a synthetic 8-band input.

# Research workaround
band_4      = load_imagery(scene)           # shape: (4, H, W)
band_8_fake = torch.cat([band_4, band_4], dim=0)
# bands 0-3 == bands 4-7 — perfectly correlated

Consequences at three levels. Representation quality: first-layer filters received perfectly correlated channel pairs across every training example, degrading spectral representations from initialization. Memory cost: doubled input tensor size on every forward pass. Batch ceiling: compressed maximum viable batch size on any fixed VRAM budget.

8-band (duplicated): 64 \times 8 \times 256 \times 256 \times 4 bytes = ~134MB per batch 4-band (correct): 64 \times 4 \times 256 \times 256 \times 4 bytes = ~67MB per batch

Fix: Ground-up retrain on native 4-band input. I defined acceptance metrics and assembled the training dataset. The university team executed retraining under those constraints. The architectural decision — retrain rather than patch weights — was mine.

DEFECT 04 Masked Pixels Included in Tensor Operations High

Valid imagery chips contained substantial no-data regions — cloud cover, cloud shadow, off-nadir edge artifacts, and acquisition gaps. The pipeline converted these to zero-filled pixels and submitted full tensors. Inference executed on every pixel, including those already flagged invalid by the mask layer.

Average masked fraction per chip: ~35% Net compute reduction from mask exclusion: ~25–35%

def build_inference_queue(chip_paths, mask_paths, threshold=0.5):
    return [
        (chip, mask)
        for c, m in zip(chip_paths, mask_paths)
        for chip, mask in [(load_chip(c), load_mask(m))]
        if 1.0 - mask.mean() >= threshold
    ]  # ~35% smaller queue

DEFECT 05 GPU Instance Used for CPU-Bound Raster Preprocessing High

Image clipping — extracting 256×256 chips from large GeoTIFF source scenes — ran on the GPU instance. This is pure CPU I/O work: file reads, coordinate transforms, pixel extraction via rasterio and GDAL. No GPU instructions are issued. Preprocessing produced sustained near-zero GPU utilization while billing GPU rates for work a CPU-only instance costing 1/50th the price could handle equivalently.

# Preprocessing: ~40-60% of GPU instance wall time On GPU instance: 0.5 \times $32.77/hr \times 20 hrs = ~$328/sprint On CPU instance: 0.5 \times $0.68/hr \times 20 hrs = ~$6.80/sprint # ~48\times cost premium for identical work

Fix: Preprocessing decoupled to a CPU-only fleet. Airflow DAG enforced stage separation — preprocessing completed before inference instances were triggered. Inference was never idle waiting for chips.

§ 05 — The production rebuildBuilt from scratch. Stage-separated. CPU-only inference.

The research partner provided model architecture, training data, and Jupyter notebooks. The production system was built from scratch. Ownership breakdown:

Fully mine: classification pipeline end-to-end — chip extraction, mask filtering, queue management, batched inference, aggregation, all Python code, all Docker images, Airflow DAG
Architecture lead: change detection stage — overall pipeline architecture and integration contracts
Directed, not implemented: model retrain — I set acceptance metrics and provided training dataset; university team executed
Inherited: research model weights and original architecture (university partner)

Pipeline architecture

STAGE 1: Raster Preprocessing (CPU-only, decoupled) ├── Source scene ingestion (GeoTIFF) ├── Coordinate-aligned chip extraction (rasterio/GDAL) ├── Mask layer generation └── Chip queue → S3 ▼ STAGE 2: Chip Filtering ├── Mask evaluation per chip ├── Below-threshold chips dropped (~35% reduction) └── Valid queue → inference worker ▼ STAGE 3: Batched Inference (CPU fleet — shared, not per-region) ├── Dynamic batch assembly (batch_size=16–32) ├── 4-band corrected model forward pass └── Per-chip result emission ▼ STAGE 4: Result Aggregation └── Region-level output → downstream scoring

Airflow DAG

with DAG('inference_pipeline', schedule_interval='@biweekly') as dag:
    preprocess   = DockerOperator(task_id='raster_preprocessing',
                                   image='inference/preprocess:slim')
    filter_queue = PythonOperator(task_id='mask_filter_chip_queue',
                                   python_callable=build_inference_queue)
    inference    = DockerOperator(task_id='batched_inference',
                                   image='inference/model:slim')
    aggregate    = PythonOperator(task_id='result_aggregation',
                                   python_callable=aggregate_regional_outputs)
    preprocess >> filter_queue >> inference >> aggregate

Docker image slimming

FROM pytorch/pytorch:2.x-cuda-runtime AS base
# runtime CUDA libs only — no cudnn-dev, no compiler
COPY requirements-inference.txt .
RUN pip install --no-cache-dir -r requirements-inference.txt
# pinned deps: torch, rasterio, numpy, boto3
COPY src/inference/ /app/
CMD ["python", "-m", "inference.worker"]
# ~400–600MB vs 2–3GB baseline · ~75% reduction across 50+ pull cycles

Before / after

Before

Per-region GPU fan-out, 20 concurrent instances
Single-sample inference (batch=1)
4-band input duplicated to fake 8-band
Masked pixels included in tensor ops
Raster preprocessing on GPU instance
Jupyter notebooks, no orchestration
2–3 GB Docker images
A100 GPU required

After

Single shared queue, 3-node CPU fleet
Dynamic batching (batch=16–32)
Native 4-band model, retrained from scratch
Mask-filtered queue, ~35% smaller
Preprocessing decoupled to CPU-only stage
Airflow DAG, stage-separated, reproducible
400–600 MB slim inference images
CPU-only inference, GPU eliminated

§ 06 — Results$26,200 → $90. No new algorithms. No changed requirements.

Dimension	Baseline	Optimized
Compute	~20× GPU instances (multi-A100)	3× CPU inference nodes
Inference mode	Single-sample, per-region isolated	Unified batched queue (batch 16–32)
Model input	4-band duplicated to 8 (corrupt)	Native 4-band (retrained)
Masking	Zero-filled passthrough	Excluded pre-queue (~35% fewer chips)
Preprocessing	Co-located on GPU instance	Decoupled CPU-only stage
Docker image	2–3 GB (full CUDA + dev tools)	~400–600 MB (runtime only)
Orchestration	Jupyter notebooks	Dockerized Airflow DAGs
Monthly cost	~$26,200	~$90
Cost reduction	—	99.6%
Model accuracy	Baseline (program eval)	Within program threshold
External validation	—	50+ independent government evaluations

# Cost verification Baseline: ~$26,200/month Optimized: 3 \times CPU nodes \times $0.68/hr \times ~45 active hrs/month = $91.80/month Reduction: ($26,200 - $90) / $26,200 = 99.66% # The "99%" claim is not rounded up. It's rounded down.

The pipeline processed the full program workload within the government client's cost and accuracy constraints across 50+ independently evaluated release cycles. Cost efficiency held through the full evaluation period without architectural rework.

§ 07 — LessonsFive things that apply to the next one.

"The first question when inheriting research code should not be 'does it produce correct outputs' but 'what will this cost at the volume we actually need.'"

Research code is not a cost baseline

Jupyter notebooks are validation artifacts, not infrastructure. The cost structure of research code — single-sample loops, monolithic compute, no stage separation — is appropriate for a lab environment and catastrophic at production scale.

Workarounds compound silently

The band duplication hack was a single line of code that simultaneously degraded model quality, doubled memory bandwidth consumption, and compressed batch size headroom. Each consequence was invisible until you looked. Research workarounds tend to solve the immediate problem (the notebook runs) while embedding structural costs that multiply at scale.

Right-sizing compute is not a tradeoff

Running CPU-bound preprocessing on a multi-A100 instance is not a "good enough for now" decision — it is a billing error. Compute class mismatches between task type and instance type produce no benefit in exchange for the excess cost. The fix is always separation.

Fan-out patterns require size awareness

Uniform infrastructure per logical unit only makes sense when logical units are uniform in size. When AOIs range from 50,000 km² to 100m×100m, the appropriate abstraction is a shared queue where cost is proportional to actual chip volume — not a per-region instance where cost is proportional to region count regardless of size.

Batch size is free throughput at scale

The transition from batch=1 to batch=64 required approximately 20 lines of code and delivered a 35× throughput improvement. Against 111 million chips, that 35× translated directly into compute hours billed.

99.6% ML inference cost reduction for global geospatial workloads.

§ 01 — Program context and constraintsAn inherited notebook. A $26,200/month billing surprise.

§ 02 — Workload sizing111 million chips. Every per-chip inefficiency multiplied.

§ 03 — Baseline architecture and cost$32/hr GPU instances. Regardless of AOI size.

§ 04 — Root cause analysis — five defectsEach independently fixable. Together, multiplicative.

§ 05 — The production rebuildBuilt from scratch. Stage-separated. CPU-only inference.

Pipeline architecture

Airflow DAG

Docker image slimming

Before / after

§ 06 — Results$26,200 → $90. No new algorithms. No changed requirements.

§ 07 — LessonsFive things that apply to the next one.

Research code is not a cost baseline

Workarounds compound silently

Right-sizing compute is not a tradeoff

Fan-out patterns require size awareness

Batch size is free throughput at scale