Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

1 minute read

Published:

Multimodal abstractive summarization (MAS) aims to take advantage of data from multiple modalities and provides a short, concise and readable textual summary to let users quickly acquire their essential information.

  • Paper Link : https://arxiv.org/pdf/2109.02401.pdf
  • Code : https://github.com/HLTCHKUST/VG-GPLMs
  • Dataset : How2 Dataset
  • Model : Attention + Feed-forward network : VG GPLMs
  • Pretrained models : BART and T5

Contributions

The contributions in this work are threefold:

  • First to inject visual information into text-only GPLMs, and to use it for the MAS task.
  • Systematically studies two research questions:
    • how to inject visual information into GPLMs without hurting their generation ability; and
    • where is the optimal place in GPLMs to inject the visual information?
  • The model significantly outperforms the stateof-the-art model on the How2 dataset, and the injected visual guidance contributes 83.6% of the overall improvement.

Summary

The paper presents vision guided GPLMs and extract video features along with text. The paper proposes attention based fusion network. The experiments are done on How2 dataset. The baselines which are compared are :

  • S2S (Seq2Seq)
  • PG (Point Generator)
  • TF (Transformer based Seq2seq)
  • HA (Hierarchical attention)
  • MFFG (Multistage fusion with forget gate)

The metrics used are :

  • ROUGE
  • BLEU
  • METEOR
  • MENTOR
  • CIDEr
  • Content F1