Fine-grained label learning via siamese network for cross-modal information retrieval

Yiming Xu, Jing Yu, Jingjing Guo, Yue Hu, Jianlong Tan · ICCS (2019)

PDF DOI Page

Problem

Cross-modal information retrieval searches for semantically relevant data in one modality given a query in another. For text–image retrieval, the common solution maps both into a shared semantic space and measures similarity directly, training on positive and negative pairs. Existing work treats all positive/negative pairs as equally positive/negative — yet many positives resemble negatives to some degree, and vice versa. These “hard examples” are exactly what existing models handle poorly.

Approach

We assign fine-grained labels that capture each example’s degree of hardness. A siamese network operates on both positive and negative examples to obtain their semantic similarities: for each pair we use the image’s text description to compute its similarity with the text in the example, and from these similarities derive the fine-grained labels. The labels feed a pairwise similarity loss that increases the influence of hard examples while maximizing similarity for relevant text–image pairs and minimizing it for irrelevant ones.

Results

Across the English Wikipedia, Chinese Wikipedia, and TVGraz datasets, incorporating fine-grained labels yields significant improvement in retrieval performance over state-of-the-art models, confirming the value of difficulty-aware supervision for cross-modal alignment.