Multi-modal Summarization for Video-containing Documents

1 minute read

Published: October 14, 2021

Existing models suffer from the following drawbacks:

Most existing applications extract visual information from the accompanying images, but they ignore related videos. The paper contends that videos contain abundant contents and have temporal characteristics where events are represented chronologically, which are crucial for text summarization.
Although attention mechanism and early fusion are used extensively, it adversely introduces noise as it is unsuitable for multi-modal data without alignment, which is characterized by a large gap that requires intensive communication.
Various multi-modal summarization works have focused on a single task, such as text or video summarization with added information from other modalities. Paper observes that both summarization tasks share the same target of refining original long materials, and as such they can be performed jointly due to common characteristics.
Paper Link : https://arxiv.org/pdf/2009.08018.pdf
Code : https://github.com/xiyan524/MM-AVS
Dataset : MM-AVS dataset
Model : M2MSM : Feature Extraction, Bi-Hop Attention and Feature Fusion.

Contributions

The main contributions are as follows:

Introduced a novel task that automatically generates a textual summary with significant images from the multi-modal data associated with an article and its corresponding video.
Proposed the bi-hop attention and improved late fusion mechanism to refine information from multi-modal data. Besides, introduced a bi-stream summarization strategy that simultaneously summarizes articles and videos.
Prepared a content-rich multi-modal dataset. Comprehensive experiments demonstrate that complementary information from multiple modalities is beneficial, and general baseline model can exploit them more effectively than the existing approaches.

Summary

M2SM is a novel multi-modal summarization model to automatically generate multimodal summary from an article and its corresponding video. It consists of :

Feature Extraction : Text Feature, Video Feature.
Feature Alignment : Single-step attention, Bi-hop attention.
Feature Fusion : Early fusion, tensor fusion and late fusion.

The baselines which are compared are :

VistaNet
MM-ATG
Img+Trans
TFN
HNNattTI
Random
Uniform
VSUMM
DR-DSN
lead3
SummaRuNNer
NN-SE

The metrics used are :

ROUGE-{1,2,L}

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

Multi-modal Summarization for Video-containing Documents

Contributions

Summary

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17