Figure 2 shows PIR as a dual-encoder retrieval framework augmented with a prior-guided instruction path and a structured objective. The model keeps the standard Swin Transformer plus BERT backbone, but changes where domain knowledge enters the pipeline.
Figure 2. Overview of PIR with Spatial-PAE, Temporal-PAE, VIR, LCA, contrastive loss, and affiliation loss.
PIR builds its method around a reusable Progressive Attention Encoder, then specializes it into visual and textual refinement paths.
Figure 3. The Transformer Encoder Layer used as the basic building block for PIR.
Figure 4. Spatial-PAE and Temporal-PAE deliver two different message-passing schemes between encoder layers.
PIR combines contrastive loss with affiliation loss so the model optimizes both pair-wise alignment and class-level structure.
Figure 5. Pair-wise loss versus cluster-wise affiliation loss.
On both RSICD and RSITMD, PIR outperforms strong Transformer baselines built on Swin Transformer and BERT.
Table 1. Comparison results on RSICD and RSITMD
Figure 6 shows that PIR returns semantically closer matches and ranks correct results earlier in both retrieval directions.
Figure 6. Top-5 retrieval visualization for image-to-text and text-to-image queries.
Table 2. Module ablations on RSITMD
Table 3. VIR filter size on RSITMD
Table 4. Instruction encoder strategy on RSITMD
Figure 7 shows that structural regularization must be balanced: too weak leaves ambiguity, and too strong hurts retrieval.
Figure 7. Center-scale analysis and t-SNE visualization of the learned embedding space.
This work is partially supported by the Natural Science Foundation of China under Grant No. 61976192 and 62102365, and Zhejiang Provincial Natural Science Foundation of China under Grant No. LR21F020002.
@inproceedings{pan2023prior,
title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={611--620},
year={2023}
}