MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

1 minute read

Published: October 13, 2021

Multimodal text summarization is the task of condensing this information from the interacting modalities into an output summary. This generated output summary may be unimodal or multimodal.

Paper Link : https://arxiv.org/pdf/2010.08021.pdf
Code : https://github.com/amankhullar/mast
Dataset : How2 dataset
Model : MAST : Modality Encoders, Trimodal Hierarchical Attention Layer and the Trimodal Decoder.

Contributions

The contributions can be summarized as follows:

Introduction of audio modality for abstractive multimodal text summarization.
Examining the challenges of utilizing audio information and understanding its contribution in the generated summary.
Proposition of a novel state of the art model, MAST, for the task of multimodal abstractive text summarization.

Summary

MAST is a sequence to sequence model that uses information from all three modalities – audio, text and video. The modality information is encoded using Modality Encoders, followed by a Trimodal Hierarchical Attention Layer, which combines this information using a three-level hierarchical attention approach. It attends to two pairs of modalities (δ) (Audio-Text and VideoText) followed by the modality in each pair (β and γ), followed by the individual features within each modality (α). The decoder utilizes this combination of modalities to generate the output over the vocabulary

The baselines which are compared are :

Hierarchical Attention models considering Audio-Text and Video-Text modalities
S2S
BertSumAbs

The metrics used are :

ROUGE-{1,2,L}
Content F1 metric

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

Contributions

Summary

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17