image PIR: A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval

ACMMM 2023(Oral)

Jiancheng Pan1, Qing Ma1, Cong Bai1


1Zhejiang University of Technology


Prior Instruction Representation

Figure 1 highlights why remote sensing image-text retrieval is harder than natural-scene retrieval: the shared embedding space is more entangled, and irrelevant scene content more easily contaminates the learned representation.

  • Entangled subspace: image and text embeddings are less cleanly separated, creating broader semantic confusion zones.
  • Strong semantic noise: background patterns, scale variation, and irrelevant objects can dominate visual tokens.
  • Core implication: the bottleneck is not only cross-modal matching, but how to form unbiased representations before alignment.
  • PIR's response: inject prior scene knowledge early, so the model becomes prior-guided, noise-aware, and structurally regularized.
Figure 1: subspace comparison and PIR pipeline

Figure 1. Comparison of natural-scene and remote sensing subspaces, followed by the PIR pipeline for reducing semantic noise.

Framework

Figure 2 shows PIR as a dual-encoder retrieval framework augmented with a prior-guided instruction path and a structured objective. The model keeps the standard Swin Transformer plus BERT backbone, but changes where domain knowledge enters the pipeline.

  • The backbone remains a dual-encoder architecture, which preserves retrieval efficiency.
  • Prior knowledge enters through a dedicated instruction encoder rather than ad hoc feature fusion.
  • Optimization is coupled to the architecture so representation refinement and alignment are consistent.
Figure 2: overall PIR framework

Figure 2. Overview of PIR with Spatial-PAE, Temporal-PAE, VIR, LCA, contrastive loss, and affiliation loss.

Mechanism

PIR builds its method around a reusable Progressive Attention Encoder, then specializes it into visual and textual refinement paths.

  • The encoder layer is defined as a reusable interaction mechanism rather than a one-off module.
  • Progressive attention separates how information is passed from what domain signal is injected.
  • This layered design makes the method easier to interpret as an architecture, not a collection of tricks.
Figure 3: transformer encoder layer

Figure 3. The Transformer Encoder Layer used as the basic building block for PIR.

Figure 4: spatial and temporal progressive attention encoders

Figure 4. Spatial-PAE and Temporal-PAE deliver two different message-passing schemes between encoder layers.

Objective

PIR combines contrastive loss with affiliation loss so the model optimizes both pair-wise alignment and class-level structure.

  • Contrastive loss provides instance-level alignment between paired images and texts.
  • Affiliation loss adds class-level structure so embeddings of similar scenes remain separable.
  • Together they optimize both local pair quality and global space geometry.
Figure 5: pair-wise loss and cluster-wise loss

Figure 5. Pair-wise loss versus cluster-wise affiliation loss.

Results

On both RSICD and RSITMD, PIR outperforms strong Transformer baselines built on Swin Transformer and BERT.

  • PIR improves over strong Swin Transformer plus BERT baselines on both benchmarks.
  • The gains appear in mean recall and remain consistent across bidirectional retrieval settings.
  • The result pattern supports the paper's claim that denoising and structure regularization are complementary.
24.46 RSICD mR
38.24 RSITMD mR
+4.0% Improvement on RSICD
+4.1% Improvement on RSITMD

Table 1. Comparison results on RSICD and RSITMD

Table 1: comparison results on RSICD and RSITMD
PIR improves mean recall by 4.0% on RSICD and 4.1% on RSITMD over the strongest Transformer baseline reported in the paper.

Retrieval

Figure 6 shows that PIR returns semantically closer matches and ranks correct results earlier in both retrieval directions.

  • Correct matches appear earlier in the ranked list.
  • Returned negatives are often semantically close, showing the task remains challenging.
  • The figure provides qualitative evidence that the learned space is better organized.
Figure 6: retrieval visualization

Figure 6. Top-5 retrieval visualization for image-to-text and text-to-image queries.

Analysis

  • The ablations test whether gains come from VIR, LCA, affiliation loss, or their combination.
  • Filter-size experiments clarify how aggressively the model should suppress noisy visual features.
  • Together, the two tables explain both where the gains come from and how stable the visual filtering strategy is.

Table 2. Module ablations on RSITMD

Table 2: module ablations on RSITMD

Table 3. VIR filter size on RSITMD

Table 3: VIR filter size on RSITMD

Table 4. Instruction encoder strategy on RSITMD

Table 4: instruction encoder strategy on RSITMD

Sensitivity

Figure 7 shows that structural regularization must be balanced: too weak leaves ambiguity, and too strong hurts retrieval.

  • The center scale directly changes how tightly class structure is enforced in the shared space.
  • Moderate regularization improves retrieval, while overly strong regularization can over-constrain the embeddings.
  • The best setting reported in the paper is lambda = 1.
Figure 7: center scale analysis and embedding visualization

Figure 7. Center-scale analysis and t-SNE visualization of the learned embedding space.

Acknowledgments

This project builds upon the open-source X-VLM codebase by Zeng et al. and uses the RSICD and RSITMD datasets. We thank the authors and maintainers of these open resources for making this work possible.

Citation


@inproceedings{pan2023prior,
  title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
  author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={611--620},
  year={2023}
}