TY - GEN
T1 - CLIP4Hashing
T2 - 2022 International Conference on Multimedia Retrieval, ICMR 2022
AU - Zhuo, Yaoxin
AU - Li, Yikang
AU - Hsiao, Jenhao
AU - Ho, Chiuman
AU - Li, Baoxin
N1 - Funding Information: Y.Zhuo and B.Li were supported in part by an ONR grant (# N00014-19-1-2119). Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of ONR. Publisher Copyright: © 2022 ACM.
PY - 2022/6/27
Y1 - 2022/6/27
N2 - With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.
AB - With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.
KW - cross-modal retrieval
KW - deep learning
KW - hashing
KW - video-text retrieval
UR - http://www.scopus.com/inward/record.url?scp=85134067897&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134067897&partnerID=8YFLogxK
U2 - 10.1145/3512527.3531381
DO - 10.1145/3512527.3531381
M3 - Conference contribution
T3 - ICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval
SP - 158
EP - 166
BT - ICMR 2022 - Proceedings of the 2022 International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
Y2 - 27 June 2022 through 30 June 2022
ER -