VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

1 minute read

Published:

In real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, Video-based Multimodal Summarization with Multimodal Output (VMSMO) is proposed, which selects cover frame from news video and generates textual summary of the news article in the meantime

  • Paper Link : https://arxiv.org/pdf/2010.05406.pdf
  • Code : https://github.com/iriscxy/VMSMO
  • Dataset : VMSMO dataset
  • Model : DIMS : Feature Encoders, Dual Interaction Module and the Multi Generator.

Contributions

The contributions can be summarized as follows:

  • Proposed a novel Video-based Multimodal Summarization with Multimodal Output (VMSMO) task which chooses a proper cover frame for the video and generates an appropriate textual summary of the article.
  • Proposed a Dual-Interaction-based Multimodal Summarizer (DIMS) model, which jointly models the temporal dependency of video with semantic meaning of article, and generates textual summary with video cover simultaneously.
  • Constructed a large-scale dataset for VMSMO, and experimental results demonstrate that our model outperforms other baselines in terms of both automatic and human evaluations.

Summary

DIMS consists of :

  • Feature Encoder is composed of a text encoder and a video encoder which encodes the input article and video separately.
  • Dual Interaction Module conducts deep interaction, including conditional self-attention and global-attention mechanism between video segment and article to learn different levels of representation of the two inputs.
  • Multi-Generator generates the textual summary and chooses the video cover by incorporating the fused information.

The baselines which are compared are :

  • Lead
  • TextRank
  • PG
  • Unified
  • GPG
  • How2
  • Synergistic
  • PSAC
  • MSMO
  • MOF

The metrics used are :

  • ROUGE-{1,2,L}
  • MAP - mean average precision