| Change captioning is to describe the differencein a pair of images with a natural languagesentence. In this task,the distractors, suchas the illumination or viewpoint change, bringthe huge challenges about learning the difference representation. In this paper, we proposea semantic relation-aware difference represen-tation learning network to explicitly learn thedifference representation in the existence ofdistractors. Specifcally, we introduce a selfsemantic relation embedding block to explorethe underlying changed objects and design across-semantic relation measuring block to lo-calize the real change and learn the discrimina-tive difference representation. Besides, relyingon the POS of words, we devise an attentionbased visual switch to dynamically use visualinformation for caption generation. Extensiveexperiments show that our method achievesthe state-of-the-art performances on CLEVRChange and Spot-the-Diff datasets. |