TY - GEN
T1 - CAVAN
T2 - 26th International Conference on Pattern Recognition, ICPR 2022
AU - Shao, Huiliang
AU - Fang, Zhiyuan
AU - Yang, Yezhou
N1 - Funding Information: Acknowledgement. This work was supported by the National Science Foundation under Grant IIS-2132724, IIS-1750082 and CNS-2038666. Publisher Copyright: © 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.
AB - It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.
UR - http://www.scopus.com/inward/record.url?scp=85143595985&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143595985&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/ICPR56361.2022.9956241
DO - https://doi.org/10.1109/ICPR56361.2022.9956241
M3 - Conference contribution
T3 - Proceedings - International Conference on Pattern Recognition
SP - 4095
EP - 4102
BT - 2022 26th International Conference on Pattern Recognition, ICPR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 January 2022
ER -