ConText

Abstract

Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on the CIRR and CIRCO datasets.

Method

Model Design

ConText-CIR extracts and learns from concepts (noun phrases) in text to improve the fusion of text and image representations for composed image retrieval.

With training, our novel Text Concept-Consistency loss guides the attention of text concepts to their relevant parts in the image. A standard model based on text-image cross attention for CIR (middle) struggles to learn this alignment.

comparison of attention with with and without text-cc.

BibTeX

@article{xing2025contextcir,
  author    = {Xing, Eric and Kolouju, Pranavi and Pless, Robert and Stylianou, Abby and Jacobs, Nathan},
  title     = {ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval},
  journal   = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025},
  year      = {2025},
}