Multi-modal Summarization for Video-containing Documents

1 minute read

Published:

Existing models suffer from the following drawbacks:

  • Most existing applications extract visual information from the accompanying images, but they ignore related videos. The paper contends that videos contain abundant contents and have temporal characteristics where events are represented chronologically, which are crucial for text summarization.
  • Although attention mechanism and early fusion are used extensively, it adversely introduces noise as it is unsuitable for multi-modal data without alignment, which is characterized by a large gap that requires intensive communication.
  • Various multi-modal summarization works have focused on a single task, such as text or video summarization with added information from other modalities. Paper observes that both summarization tasks share the same target of refining original long materials, and as such they can be performed jointly due to common characteristics.

  • Paper Link : https://arxiv.org/pdf/2009.08018.pdf
  • Code : https://github.com/xiyan524/MM-AVS
  • Dataset : MM-AVS dataset
  • Model : M2MSM : Feature Extraction, Bi-Hop Attention and Feature Fusion.

Contributions

The main contributions are as follows:

  • Introduced a novel task that automatically generates a textual summary with significant images from the multi-modal data associated with an article and its corresponding video.
  • Proposed the bi-hop attention and improved late fusion mechanism to refine information from multi-modal data. Besides, introduced a bi-stream summarization strategy that simultaneously summarizes articles and videos.
  • Prepared a content-rich multi-modal dataset. Comprehensive experiments demonstrate that complementary information from multiple modalities is beneficial, and general baseline model can exploit them more effectively than the existing approaches.

Summary

M2SM is a novel multi-modal summarization model to automatically generate multimodal summary from an article and its corresponding video. It consists of :

  • Feature Extraction : Text Feature, Video Feature.
  • Feature Alignment : Single-step attention, Bi-hop attention.
  • Feature Fusion : Early fusion, tensor fusion and late fusion.

The baselines which are compared are :

  • VistaNet
  • MM-ATG
  • Img+Trans
  • TFN
  • HNNattTI
  • Random
  • Uniform
  • VSUMM
  • DR-DSN
  • lead3
  • SummaRuNNer
  • NN-SE

The metrics used are :

  • ROUGE-{1,2,L}