Overview
Implemented inference using the AnySat multimodal foundation model (CVPR 2025 Highlight) for flood segmentation, demonstrating expertise with state-of-the-art multi-sensor fusion architectures.
Technical Implementation
Multi-Modal Fusion
Processed heterogeneous satellite data combining:
- Sentinel-1 SAR: 3 channels (VV, VH, ratio)
- Sentinel-2 Optical: 10 channels
Scale-Adaptive Architecture
Implemented AnySat's JEPA-based architecture (125M parameters) handling varying spatial resolutions from 10m to 60m.
Three Output Modes
- Tile mode: [B, 768] embeddings for scene classification
- Patch mode: [B, P, P, 768] grid for patch-level tasks
- Dense mode: [B, H, W, 1536] per-pixel features for segmentation
Technical Complexity
- Handled 11 different sensor types (Sentinel-1/2, NAIP, ALOS-2, aerial imagery)
- Implemented temporal dimension processing for multi-date analysis
- Managed varying channel counts across sensors (2-13 channels)
- Built preprocessing for SAR-optical fusion with proper normalization
Model Architecture
- Vision Transformer: 768-dim embeddings, 6+1 blocks, 12 attention heads
- Modality projectors for 11+ sensor types
- Scale-adaptive JEPA (Joint-Embedding Predictive Architecture)