Figure 2 shows PIR as a dual-encoder retrieval framework augmented with a prior-guided instruction path and a structured objective. The model keeps the standard Swin Transformer plus BERT backbone, but changes where domain knowledge enters the pipeline.
Figure 2. Overview of PIR with Spatial-PAE, Temporal-PAE, VIR, LCA, contrastive loss, and affiliation loss.
PIR builds its method around a reusable Progressive Attention Encoder, then specializes it into visual and textual refinement paths.
Figure 3. The Transformer Encoder Layer used as the basic building block for PIR.
Figure 4. Spatial-PAE and Temporal-PAE deliver two different message-passing schemes between encoder layers.
PIR combines contrastive loss with affiliation loss so the model optimizes both pair-wise alignment and class-level structure.
Figure 5. Pair-wise loss versus cluster-wise affiliation loss.
On both RSICD and RSITMD, PIR outperforms strong Transformer baselines built on Swin Transformer and BERT.
Table 1. Comparison results on RSICD and RSITMD
Figure 6 shows that PIR returns semantically closer matches and ranks correct results earlier in both retrieval directions.
Figure 6. Top-5 retrieval visualization for image-to-text and text-to-image queries.
Table 2. Module ablations on RSITMD
Table 3. VIR filter size on RSITMD
Table 4. Instruction encoder strategy on RSITMD
Figure 7 shows that structural regularization must be balanced: too weak leaves ambiguity, and too strong hurts retrieval.
Figure 7. Center-scale analysis and t-SNE visualization of the learned embedding space.
@inproceedings{pan2023prior,
title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={611--620},
year={2023}
}