Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

1 minute read

Published: October 11, 2021

Multimodal abstractive summarization (MAS) aims to take advantage of data from multiple modalities and provides a short, concise and readable textual summary to let users quickly acquire their essential information.

Paper Link : https://arxiv.org/pdf/2109.02401.pdf
Code : https://github.com/HLTCHKUST/VG-GPLMs
Dataset : How2 Dataset
Model : Attention + Feed-forward network : VG GPLMs
Pretrained models : BART and T5

Contributions

The contributions in this work are threefold:

First to inject visual information into text-only GPLMs, and to use it for the MAS task.
Systematically studies two research questions:
- how to inject visual information into GPLMs without hurting their generation ability; and
- where is the optimal place in GPLMs to inject the visual information?
The model significantly outperforms the stateof-the-art model on the How2 dataset, and the injected visual guidance contributes 83.6% of the overall improvement.

Summary

The paper presents vision guided GPLMs and extract video features along with text. The paper proposes attention based fusion network. The experiments are done on How2 dataset. The baselines which are compared are :

S2S (Seq2Seq)
PG (Point Generator)
TF (Transformer based Seq2seq)
HA (Hierarchical attention)
MFFG (Multistage fusion with forget gate)

The metrics used are :

ROUGE
BLEU
METEOR
MENTOR
CIDEr
Content F1

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Contributions

Summary

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17