image EarthSynth: Generating Informative Earth Observation with Diffusion Models

ArXiv 2025


1Tsinghua University, 2University of Sydney, 3INSAIT, Sofia University "St. Kliment Ohridski", 4University of Chinese Academy of Sciences, 5Wuhan University, 6University of Science and Technology of China , 7Shanghai Jiao Tong University


image

A diffusion-based generative foundation model, EarthSynth, pretrained on multi-source and multi-category data, synthesizing Earth observation with a semantic mask and text for downstream remote sensing image interpretation tasks.

Motivations

Remote sensing image (RSI) interpretation is fundamentally constrained by challenges such as severe class imbalance and limited availability of high-quality labeled data, significantly impeding the development of robust models for downstream tasks. However, labeling RSIs typically requires domain-specific expertise and substantial manual effort, making large-scale annotation time-consuming and costly. Consequently, an important research objective is to effectively exploit existing labeled Earth observation (EO) datasets by uncovering latent relationships among samples to improve data efficiency.

Contributions

🌍 EarthSynth and EarthSynth-180K

  • We propose EarthSynth, a diffusion-based generative foundation model trained on the EarthSynth-180K dataset with 180K samples aligned image, semantic mask, and text, achieving a unified solution to achieve multi-task generation.

🛰️ Counterfactual Composition (CF-Comp) and R-Filter

  • EarthSynth employs the CF-Comp strategy to balance the layout controllability and category diversity during training, enabling fine layout control for RSI generation. It further incorporates the R-Filter to extract more informative and high-quality synthesized samples.

EarthSynth-180K Dataset

EarthSynth-180K is derived from OEM, LoveDA, DeepGlobe, SAMRS, and LAE-1M datasets. It is further enhanced with mask and text prompt conditions, making it suitable for training foundation diffusion-based generative model. The EarthSynth-180K dataset is constructed using the Random Cropping and Category Augmentation strategies.

EarthSynth Model

EarthSynth is trained with CF-Comp training strategy on real and unrealistic logical mixed data distribution, learns remote sensing pixel-level properties in multiple dimensions, and builds a unified process for conditional diffusion training and synthesis.

Citation

Please consider cite us if you find our dataset, or model is useful to you.

      @misc{pan2025earthsynthgeneratinginformativeearth,
        title={EarthSynth: Generating Informative Earth Observation with Diffusion Models}, 
        author={Jiancheng Pan and Shiye Lei and Yuqian Fu and Jiahao Li and Yanxing Liu and Yuze Sun and Xiao He and Long Peng and Xiaomeng Huang and Bo Zhao},
        year={2025},
        eprint={2505.12108},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.12108}, 
        }