§ 01 — Program context and constraintsAn inherited notebook. A $26,200/month billing surprise.
This engagement was a federally-funded geospatial intelligence program requiring automated detection of heavy construction activity from multispectral satellite imagery. The scope was planetary: geographic coverage 10–20× the size of Moscow, with imagery revisits every two weeks spanning 5–10 year historical windows per area of interest.
The program's primary evaluation metric was cost per unit of processed area — not accuracy alone, not throughput alone. Cost efficiency was a first-class scored deliverable, validated by the government client across 50+ independent release cycles throughout the program lifecycle. Budget overruns were not recoverable.
The baseline inference system was inherited from a university research partner and delivered as Jupyter notebooks. My mandate was to transform that research artifact into a production-grade pipeline. What the notebooks contained was a set of compounding engineering defects that, left unaddressed, would have made the program economically unviable.
§ 02 — Workload sizing111 million chips. Every per-chip inefficiency multiplied.
Understanding the cost problem requires understanding the scale. Every per-chip inefficiency multiplied across 111 million chips.
| Parameter | Value | Derivation |
|---|---|---|
| Coverage per run | ~37,500 km² | Midpoint of 10–20× Moscow (2,511 km²) |
| Image resolution | 2m/px | Program specification |
| Chip size | 256 × 256 px | = 512m × 512m ground footprint |
| Chips per time step | ~570,000 | 37,500 km² ÷ (0.512 km)² |
| Revisit cadence | 26/year | Every 2 weeks |
| Historical window | 7.5 years | Midpoint of 5–10 year range |
| Total chips | ~111 million | 570,000 × 26 × 7.5 |
The model itself was a mid-sized PyTorch CNN operating on multispectral 4-band imagery at 2m resolution — a pure inference workload with no training in the production loop. The problem was not model complexity. It was everything built around the model.
§ 03 — Baseline architecture and cost$32/hr GPU instances. Regardless of AOI size.
The baseline system used high-end GPU compute instances — multi-A100 configurations. Cost figures use current AWS on-demand public rates as a conservative proxy; actual contract rates were negotiated. The structural cost drivers and proportional reduction are accurate regardless of the specific rate applied.
AOI sizes varied enormously across the program — some regions covered tens of thousands of square kilometers, others covered areas as small as a single 100×100m construction site. Both triggered identical instance configurations. Small AOIs generated near-zero GPU utilization before teardown, billing GPU-hours for a handful of forward passes.
§ 04 — Root cause analysis — five defectsEach independently fixable. Together, multiplicative.
Each defect below was independently fixable. Together they were multiplicative: the waste from each compounded the savings from the one before it. Addressing all five was necessary to reach the final reduction.
One GPU container per geographic region, irrespective of AOI size. Infrastructure cost was a function of region count, not workload volume. A 50,000 km² metro and a 100×100m single-site monitoring target triggered identical $32/hr instance spin-ups.
Fix: Regional workloads consolidated into a single shared inference queue. AOI size determined chip count in the queue, not instance count.
The model ran inference on one 256×256 chip at a time. On a modern GPU, single-sample forward pass time is approximately 8–12ms — dominated by kernel launch latency and host-device memory transfer overhead, not arithmetic throughput. The GPU's parallel compute capacity was almost entirely idle between launches.
# Before: single-sample loop for chip_path in chip_list: results.append(model(preprocess(load_chip(chip_path)))) # After: batched queue def batched_inference(chip_queue, model, batch_size=64): for batch in iter_batches(chip_queue, batch_size): tensors = torch.stack([preprocess(c) for c in batch]) with torch.no_grad(): yield from zip(batch, model(tensors))
The model was originally architected for 8-band multispectral input. The actual sensor data was 4-band. Rather than fix the model, the research code duplicated the 4-band tensor along the channel dimension to produce a synthetic 8-band input.
# Research workaround band_4 = load_imagery(scene) # shape: (4, H, W) band_8_fake = torch.cat([band_4, band_4], dim=0) # bands 0-3 == bands 4-7 — perfectly correlated
Consequences at three levels. Representation quality: first-layer filters received perfectly correlated channel pairs across every training example, degrading spectral representations from initialization. Memory cost: doubled input tensor size on every forward pass. Batch ceiling: compressed maximum viable batch size on any fixed VRAM budget.
Fix: Ground-up retrain on native 4-band input. I defined acceptance metrics and assembled the training dataset. The university team executed retraining under those constraints. The architectural decision — retrain rather than patch weights — was mine.
Valid imagery chips contained substantial no-data regions — cloud cover, cloud shadow, off-nadir edge artifacts, and acquisition gaps. The pipeline converted these to zero-filled pixels and submitted full tensors. Inference executed on every pixel, including those already flagged invalid by the mask layer.
def build_inference_queue(chip_paths, mask_paths, threshold=0.5): return [ (chip, mask) for c, m in zip(chip_paths, mask_paths) for chip, mask in [(load_chip(c), load_mask(m))] if 1.0 - mask.mean() >= threshold ] # ~35% smaller queue
Image clipping — extracting 256×256 chips from large GeoTIFF source scenes — ran on the GPU instance. This is pure CPU I/O work: file reads, coordinate transforms, pixel extraction via rasterio and GDAL. No GPU instructions are issued. Preprocessing produced sustained near-zero GPU utilization while billing GPU rates for work a CPU-only instance costing 1/50th the price could handle equivalently.
Fix: Preprocessing decoupled to a CPU-only fleet. Airflow DAG enforced stage separation — preprocessing completed before inference instances were triggered. Inference was never idle waiting for chips.
§ 05 — The production rebuildBuilt from scratch. Stage-separated. CPU-only inference.
The research partner provided model architecture, training data, and Jupyter notebooks. The production system was built from scratch. Ownership breakdown:
- Fully mine: classification pipeline end-to-end — chip extraction, mask filtering, queue management, batched inference, aggregation, all Python code, all Docker images, Airflow DAG
- Architecture lead: change detection stage — overall pipeline architecture and integration contracts
- Directed, not implemented: model retrain — I set acceptance metrics and provided training dataset; university team executed
- Inherited: research model weights and original architecture (university partner)
Pipeline architecture
Airflow DAG
with DAG('inference_pipeline', schedule_interval='@biweekly') as dag: preprocess = DockerOperator(task_id='raster_preprocessing', image='inference/preprocess:slim') filter_queue = PythonOperator(task_id='mask_filter_chip_queue', python_callable=build_inference_queue) inference = DockerOperator(task_id='batched_inference', image='inference/model:slim') aggregate = PythonOperator(task_id='result_aggregation', python_callable=aggregate_regional_outputs) preprocess >> filter_queue >> inference >> aggregate
Docker image slimming
FROM pytorch/pytorch:2.x-cuda-runtime AS base # runtime CUDA libs only — no cudnn-dev, no compiler COPY requirements-inference.txt . RUN pip install --no-cache-dir -r requirements-inference.txt # pinned deps: torch, rasterio, numpy, boto3 COPY src/inference/ /app/ CMD ["python", "-m", "inference.worker"] # ~400–600MB vs 2–3GB baseline · ~75% reduction across 50+ pull cycles
Before / after
- Per-region GPU fan-out, 20 concurrent instances
- Single-sample inference (batch=1)
- 4-band input duplicated to fake 8-band
- Masked pixels included in tensor ops
- Raster preprocessing on GPU instance
- Jupyter notebooks, no orchestration
- 2–3 GB Docker images
- A100 GPU required
- Single shared queue, 3-node CPU fleet
- Dynamic batching (batch=16–32)
- Native 4-band model, retrained from scratch
- Mask-filtered queue, ~35% smaller
- Preprocessing decoupled to CPU-only stage
- Airflow DAG, stage-separated, reproducible
- 400–600 MB slim inference images
- CPU-only inference, GPU eliminated
§ 06 — Results$26,200 → $90. No new algorithms. No changed requirements.
| Dimension | Baseline | Optimized |
|---|---|---|
| Compute | ~20× GPU instances (multi-A100) | 3× CPU inference nodes |
| Inference mode | Single-sample, per-region isolated | Unified batched queue (batch 16–32) |
| Model input | 4-band duplicated to 8 (corrupt) | Native 4-band (retrained) |
| Masking | Zero-filled passthrough | Excluded pre-queue (~35% fewer chips) |
| Preprocessing | Co-located on GPU instance | Decoupled CPU-only stage |
| Docker image | 2–3 GB (full CUDA + dev tools) | ~400–600 MB (runtime only) |
| Orchestration | Jupyter notebooks | Dockerized Airflow DAGs |
| Monthly cost | ~$26,200 | ~$90 |
| Cost reduction | — | 99.6% |
| Model accuracy | Baseline (program eval) | Within program threshold |
| External validation | — | 50+ independent government evaluations |
The pipeline processed the full program workload within the government client's cost and accuracy constraints across 50+ independently evaluated release cycles. Cost efficiency held through the full evaluation period without architectural rework.
§ 07 — LessonsFive things that apply to the next one.
"The first question when inheriting research code should not be 'does it produce correct outputs' but 'what will this cost at the volume we actually need.'"
Research code is not a cost baseline
Jupyter notebooks are validation artifacts, not infrastructure. The cost structure of research code — single-sample loops, monolithic compute, no stage separation — is appropriate for a lab environment and catastrophic at production scale.
Workarounds compound silently
The band duplication hack was a single line of code that simultaneously degraded model quality, doubled memory bandwidth consumption, and compressed batch size headroom. Each consequence was invisible until you looked. Research workarounds tend to solve the immediate problem (the notebook runs) while embedding structural costs that multiply at scale.
Right-sizing compute is not a tradeoff
Running CPU-bound preprocessing on a multi-A100 instance is not a "good enough for now" decision — it is a billing error. Compute class mismatches between task type and instance type produce no benefit in exchange for the excess cost. The fix is always separation.
Fan-out patterns require size awareness
Uniform infrastructure per logical unit only makes sense when logical units are uniform in size. When AOIs range from 50,000 km² to 100m×100m, the appropriate abstraction is a shared queue where cost is proportional to actual chip volume — not a per-region instance where cost is proportional to region count regardless of size.
Batch size is free throughput at scale
The transition from batch=1 to batch=64 required approximately 20 lines of code and delivered a 35× throughput improvement. Against 111 million chips, that 35× translated directly into compute hours billed.