Multimodal Summarization with Guidance of Multimodal Reference

1 minute read

Published:

this paper aims to guide multimodal summarization with the multimodal reference as the target and to evaluate multimodal outputs as a whole.

  • Paper Link : https://aaai.org/ojs/index.php/AAAI/article/view/6525/6381
  • Dataset : MSMO dataset
  • Model : Image Selection and Multi-Objective Function

Contributions

The main contributions are as follows:

  • Introduced a multimodal objective function to incorporate the multimodal reference into the training process, in which both the summary generation and the image selection are considered.
  • Proposed a novel evaluation method to evaluate a multimodal summary by projecting both the multimodal summary and the reference into a joint semantic space.
  • The experimental results show that the proposed model outperforms existing methods with both automatic and manual evaluation metrics. Moreover, the proposed evaluation method can effectively improve the correlation with human judgments.

Summary

The paper proposes a multimodal objective function, which considers both the text loss and image loss, to improve multimodal summarization with the guidance of multimodal reference. It introduces an image discriminator based on the multimodal attention model together with the multimodal objective function. Due to the lack of multimodal reference, the paper explores two strategies to construct the multimodal reference by extending the text reference. Finally, the paper designs a multimodal automatic evaluation metric by treating the multimodal outputs as a whole during evaluation.

The baselines which are compared are :

  • ATG
  • ATL
  • HAN
  • GR
  • MOF

The metrics used are :

  • ROUGE-{1,2,L}
  • MMAE
  • Msim
  • Img-Sum
  • IP
  • MRmax
  • MR-Summax