VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

1 minute read

Published: October 14, 2021

In real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, Video-based Multimodal Summarization with Multimodal Output (VMSMO) is proposed, which selects cover frame from news video and generates textual summary of the news article in the meantime

Paper Link : https://arxiv.org/pdf/2010.05406.pdf
Code : https://github.com/iriscxy/VMSMO
Dataset : VMSMO dataset
Model : DIMS : Feature Encoders, Dual Interaction Module and the Multi Generator.

Contributions

The contributions can be summarized as follows:

Proposed a novel Video-based Multimodal Summarization with Multimodal Output (VMSMO) task which chooses a proper cover frame for the video and generates an appropriate textual summary of the article.
Proposed a Dual-Interaction-based Multimodal Summarizer (DIMS) model, which jointly models the temporal dependency of video with semantic meaning of article, and generates textual summary with video cover simultaneously.
Constructed a large-scale dataset for VMSMO, and experimental results demonstrate that our model outperforms other baselines in terms of both automatic and human evaluations.

Summary

DIMS consists of :

Feature Encoder is composed of a text encoder and a video encoder which encodes the input article and video separately.
Dual Interaction Module conducts deep interaction, including conditional self-attention and global-attention mechanism between video segment and article to learn different levels of representation of the two inputs.
Multi-Generator generates the textual summary and chooses the video cover by incorporating the fused information.

The baselines which are compared are :

Lead
TextRank
PG
Unified
GPG
How2
Synergistic
PSAC
MSMO
MOF

The metrics used are :

ROUGE-{1,2,L}
MAP - mean average precision

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Contributions

Summary

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17