PIR: A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval

ACMMM 2023(Oral)

Jiancheng Pan¹, Qing Ma¹, Cong Bai¹

¹Zhejiang University of Technology

Prior Instruction Representation

Figure 1 highlights why remote sensing image-text retrieval is harder than natural-scene retrieval: the shared embedding space is more entangled, and irrelevant scene content more easily contaminates the learned representation.

Entangled subspace: image and text embeddings are less cleanly separated, creating broader semantic confusion zones.
Strong semantic noise: background patterns, scale variation, and irrelevant objects can dominate visual tokens.
Core implication: the bottleneck is not only cross-modal matching, but how to form unbiased representations before alignment.
PIR's response: inject prior scene knowledge early, so the model becomes prior-guided, noise-aware, and structurally regularized.

Figure 1: subspace comparison and PIR pipeline

Figure 1. Comparison of natural-scene and remote sensing subspaces, followed by the PIR pipeline for reducing semantic noise.

Framework

Figure 2 shows PIR as a dual-encoder retrieval framework augmented with a prior-guided instruction path and a structured objective. The model keeps the standard Swin Transformer plus BERT backbone, but changes where domain knowledge enters the pipeline.

The backbone remains a dual-encoder architecture, which preserves retrieval efficiency.
Prior knowledge enters through a dedicated instruction encoder rather than ad hoc feature fusion.
Optimization is coupled to the architecture so representation refinement and alignment are consistent.

Figure 2. Overview of PIR with Spatial-PAE, Temporal-PAE, VIR, LCA, contrastive loss, and affiliation loss.

Mechanism

PIR builds its method around a reusable Progressive Attention Encoder, then specializes it into visual and textual refinement paths.

The encoder layer is defined as a reusable interaction mechanism rather than a one-off module.
Progressive attention separates how information is passed from what domain signal is injected.
This layered design makes the method easier to interpret as an architecture, not a collection of tricks.

Figure 3. The Transformer Encoder Layer used as the basic building block for PIR.

Figure 4: spatial and temporal progressive attention encoders

Figure 4. Spatial-PAE and Temporal-PAE deliver two different message-passing schemes between encoder layers.

Objective

PIR combines contrastive loss with affiliation loss so the model optimizes both pair-wise alignment and class-level structure.

Contrastive loss provides instance-level alignment between paired images and texts.
Affiliation loss adds class-level structure so embeddings of similar scenes remain separable.
Together they optimize both local pair quality and global space geometry.

Figure 5: pair-wise loss and cluster-wise loss

Figure 5. Pair-wise loss versus cluster-wise affiliation loss.

Results

On both RSICD and RSITMD, PIR outperforms strong Transformer baselines built on Swin Transformer and BERT.

PIR improves over strong Swin Transformer plus BERT baselines on both benchmarks.
The gains appear in mean recall and remain consistent across bidirectional retrieval settings.
The result pattern supports the paper's claim that denoising and structure regularization are complementary.

24.46 RSICD mR

38.24 RSITMD mR

+4.0% Improvement on RSICD

+4.1% Improvement on RSITMD

Table 1. Comparison results on RSICD and RSITMD

PIR improves mean recall by 4.0% on RSICD and 4.1% on RSITMD over the strongest Transformer baseline reported in the paper.

Retrieval

Figure 6 shows that PIR returns semantically closer matches and ranks correct results earlier in both retrieval directions.

Correct matches appear earlier in the ranked list.
Returned negatives are often semantically close, showing the task remains challenging.
The figure provides qualitative evidence that the learned space is better organized.

Figure 6. Top-5 retrieval visualization for image-to-text and text-to-image queries.

Analysis

The ablations test whether gains come from VIR, LCA, affiliation loss, or their combination.
Filter-size experiments clarify how aggressively the model should suppress noisy visual features.
Together, the two tables explain both where the gains come from and how stable the visual filtering strategy is.

Table 2. Module ablations on RSITMD

Table 3. VIR filter size on RSITMD

Table 4. Instruction encoder strategy on RSITMD

Sensitivity

Figure 7 shows that structural regularization must be balanced: too weak leaves ambiguity, and too strong hurts retrieval.

The center scale directly changes how tightly class structure is enforced in the shared space.
Moderate regularization improves retrieval, while overly strong regularization can over-constrain the embeddings.
The best setting reported in the paper is lambda = 1.

Figure 7: center scale analysis and embedding visualization

Figure 7. Center-scale analysis and t-SNE visualization of the learned embedding space.

Acknowledgments

This project builds upon the open-source X-VLM codebase by Zeng et al. and uses the RSICD and RSITMD datasets. We thank the authors and maintainers of these open resources for making this work possible.

Citation


@inproceedings{pan2023prior,
  title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
  author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={611--620},
  year={2023}
}