Why Fine-tuning Beats Prompting for Satellite AI

The Problem with Prompting Satellite Models

When I started working on flood detection, the natural first move was to try zero-shot inference with a large foundation model. The results were bad — not "needs tuning" bad, but "completely useless for emergency response" bad.

The reason isn't that foundation models are weak. It's that satellite imagery is genuinely different from the data these models were trained on.

Sentinel-2 imagery has six to thirteen spectral bands, including near-infrared and short-wave infrared wavelengths that are invisible to the human eye and completely absent from the natural image datasets that most vision models train on. When you ask a model to "find flooded areas" in a false-color composite where water appears in a hue that has no natural-language equivalent, prompting does nothing. The model has no concept to attach to your instruction.

What Fine-tuning Actually Does Here

Fine-tuning isn't about teaching the model new words — it's about teaching it a new sensory vocabulary.

When I fine-tuned Prithvi EO-2.0 on the Sen1Floods11 benchmark, the model learned:

That specific combinations of NIR and SWIR reflectance values correspond to open water
That the spatial texture of flooded rice paddies differs from flooded urban areas in ways that matter
That cloud shadows produce spectral signatures that look like water but aren't

None of this is expressible as a prompt. It's pattern recognition over a measurement space that has no human-readable analog.

The Numbers

The baseline Prithvi-2.0 model achieves 0.14 IoU on flood detection out of the box. After fine-tuning with a hybrid Dice + Focal loss function optimized for the severe class imbalance between flood and non-flood pixels, the same architecture achieves 0.72 IoU — a +437% improvement.

That jump isn't from better prompting. It's from showing the model 50,000 labeled satellite chips and letting it build internal representations that match the actual measurement physics.

When Prompting Does Work (and When It Doesn't)

Prompting still earns its place in earth observation workflows — just at a different layer:

Where prompting works:

Routing a user's natural-language query to the right analysis pipeline
Generating explanations of model outputs for non-technical stakeholders
Summarizing change detection results across a time series

Where it fails:

Any task that requires interpreting spectral bands beyond RGB
High-stakes classification where 0.14 IoU would get someone killed in a disaster response scenario
Sub-pixel accuracy segmentation at 10m resolution

The Practical Takeaway

If you're building on satellite data and your first instinct is to reach for a large general-purpose model with a clever system prompt, test that assumption fast. Run a simple baseline on real labeled data. If your IoU is below 0.5 on a binary task, you're probably not in prompting territory — you're in fine-tuning territory.

The good news is that fine-tuning foundation models like Prithvi EO-2.0 is cheaper and faster than it sounds. The pre-trained weights already encode useful low-level spatial features. You're not training from scratch — you're redirecting.