VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles
Published:
In real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, Video-based Multimodal Summarization with Multimodal Output (VMSMO) is proposed, which selects cover frame from news video and generates textual summary of the news article in the meantime
- Paper Link : https://arxiv.org/pdf/2010.05406.pdf
- Code : https://github.com/iriscxy/VMSMO
- Dataset : VMSMO dataset
- Model : DIMS : Feature Encoders, Dual Interaction Module and the Multi Generator.
Contributions
The contributions can be summarized as follows:
- Proposed a novel Video-based Multimodal Summarization with Multimodal Output (VMSMO) task which chooses a proper cover frame for the video and generates an appropriate textual summary of the article.
- Proposed a Dual-Interaction-based Multimodal Summarizer (DIMS) model, which jointly models the temporal dependency of video with semantic meaning of article, and generates textual summary with video cover simultaneously.
- Constructed a large-scale dataset for VMSMO, and experimental results demonstrate that our model outperforms other baselines in terms of both automatic and human evaluations.
Summary
DIMS consists of :
- Feature Encoder is composed of a text encoder and a video encoder which encodes the input article and video separately.
- Dual Interaction Module conducts deep interaction, including conditional self-attention and global-attention mechanism between video segment and article to learn different levels of representation of the two inputs.
- Multi-Generator generates the textual summary and chooses the video cover by incorporating the fused information.
The baselines which are compared are :
- Lead
- TextRank
- PG
- Unified
- GPG
- How2
- Synergistic
- PSAC
- MSMO
- MOF
The metrics used are :
- ROUGE-{1,2,L}
- MAP - mean average precision