Project — Tushar Thokdar

Overview

Implemented inference using the AnySat multimodal foundation model (CVPR 2025 Highlight) for flood segmentation, demonstrating expertise with state-of-the-art multi-sensor fusion architectures.

Technical Implementation

Multi-Modal Fusion

Processed heterogeneous satellite data combining:

Sentinel-1 SAR: 3 channels (VV, VH, ratio)
Sentinel-2 Optical: 10 channels

Scale-Adaptive Architecture

Implemented AnySat's JEPA-based architecture (125M parameters) handling varying spatial resolutions from 10m to 60m.

Three Output Modes

Tile mode: [B, 768] embeddings for scene classification
Patch mode: [B, P, P, 768] grid for patch-level tasks
Dense mode: [B, H, W, 1536] per-pixel features for segmentation

Technical Complexity

Handled 11 different sensor types (Sentinel-1/2, NAIP, ALOS-2, aerial imagery)
Implemented temporal dimension processing for multi-date analysis
Managed varying channel counts across sensors (2-13 channels)
Built preprocessing for SAR-optical fusion with proper normalization

Model Architecture

Vision Transformer: 768-dim embeddings, 6+1 blocks, 12 attention heads
Modality projectors for 11+ sensor types
Scale-adaptive JEPA (Joint-Embedding Predictive Architecture)