Multimodal Sentence Summarization via Multimodal Selective Encoding

1 minute read

Published: October 12, 2021

Li et. al proposed a hierarchical attention model for the multimodal sentence summarization task, while the image is not involved in the process of text encoding. Obviously, it will be easier for the decoder to generate an accurate summary if the encoder can filter out trivial information when encoding the input sentence. Based on this idea, paper proposes a multimodal selective mechanism which aims to select the highlights from the input text using visual signals, and then the decoder generates the summary using the filtered encoding information. Concretely, an encoder reads the input text and generates the hidden representations. Then, multimodal selective gates measure the relevance between the input words and the image to construct the selected hidden representation. Finally, a decoder generates the summary using the selected hidden representation.

Paper Link : https://aclanthology.org/2020.coling-main.496.pdf
Dataset : Public multimodal sentence summarization dataset
Model : Encoder with selective gates
Pretrained models : GRU and VGG19

Contributions

The contributions can be summarized as follows:

Proposed a novel multimodal selective mechanism that can use both the textual and visual signals to select the important information from the source text.
Proposed a visual-guided modality regularization module to encourage the model focus on the key information in the source.
The experimental results on a multimodal sentence summarization dataset demonstrate that the proposed system can take advantage of multimodal information and outperform baseline methods.

Summary

The input of the multimodal text summarization task is a pair of text and image, and the output is a textual summary. The paper proposes visual selective gates to encourage important source information encoded into the second-level hidden sequence and argues that the text can be pertinent to visual information at the level of the whole image, the image parts, and the object proposals in the image. Designed three visual selective gates: global-level, grid-level, and object-level gates. Next, the summary decoder produces the summary based on the second-level hidden states. The baselines which are compared are :

Lead
Compress
ABS
SEASS
Multi-Source
Doubly-attentive
Seq2Seq
PGNet
MAtt

The metrics used are :

ROUGE-{1,2,L}

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

Multimodal Sentence Summarization via Multimodal Selective Encoding

Contributions

Summary

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17