CLIP for All Things基于零样本草图的图像检索,是否细粒度 CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

作者:Aneeshan Sain Ayan Kumar Bhunia Pinaki Nath Chowdhury Subhadeep Koley Tao Xiang Yi-Zhe Song


In this paper, we leverage CLIP for zero-shot sketch based image retrieval(ZS-SBIR). We are largely inspired by recent advances on foundation models andthe unparalleled generalisation ability they seem to offer, but for the firsttime tailor it to benefit the sketch community. We put forward novel designs onhow best to achieve this synergy, for both the category setting and thefine-grained setting (“all”). At the very core of our solution is a promptlearning setup. First we show just via factoring in sketch-specific prompts, wealready have a category-level ZS-SBIR system that overshoots all prior arts, bya large margin (24.8%) – a great testimony on studying the CLIP and ZS-SBIRsynergy. Moving onto the fine-grained setup is however trickier, and requires adeeper dive into this synergy. For that, we come up with two specific designsto tackle the fine-grained matching nature of the problem: (i) an additionalregularisation loss to ensure the relative separation between sketches andphotos is uniform across categories, which is not the case for the goldstandard standalone triplet loss, and (ii) a clever patch shuffling techniqueto help establishing instance-level structural correspondences betweensketch-photo pairs. With these designs, we again observe significantperformance gains in the region of 26.9% over previous state-of-the-art. Thetake-home message, if any, is the proposed CLIP and prompt learning paradigmcarries great promise in tackling other sketch-related tasks (not limited toZS-SBIR) where data scarcity remains a great challenge. Code and models will bemade available.



Related posts