MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

1 minute read

Published:

Multimodal text summarization is the task of condensing this information from the interacting modalities into an output summary. This generated output summary may be unimodal or multimodal.

  • Paper Link : https://arxiv.org/pdf/2010.08021.pdf
  • Code : https://github.com/amankhullar/mast
  • Dataset : How2 dataset
  • Model : MAST : Modality Encoders, Trimodal Hierarchical Attention Layer and the Trimodal Decoder.

Contributions

The contributions can be summarized as follows:

  • Introduction of audio modality for abstractive multimodal text summarization.
  • Examining the challenges of utilizing audio information and understanding its contribution in the generated summary.
  • Proposition of a novel state of the art model, MAST, for the task of multimodal abstractive text summarization.

Summary

MAST is a sequence to sequence model that uses information from all three modalities – audio, text and video. The modality information is encoded using Modality Encoders, followed by a Trimodal Hierarchical Attention Layer, which combines this information using a three-level hierarchical attention approach. It attends to two pairs of modalities (δ) (Audio-Text and VideoText) followed by the modality in each pair (β and γ), followed by the individual features within each modality (α). The decoder utilizes this combination of modalities to generate the output over the vocabulary

The baselines which are compared are :

  • Hierarchical Attention models considering Audio-Text and Video-Text modalities
  • S2S
  • BertSumAbs

The metrics used are :

  • ROUGE-{1,2,L}
  • Content F1 metric