Multimodal Sentence Summarization via Multimodal Selective Encoding

1 minute read

Published:

Li et. al proposed a hierarchical attention model for the multimodal sentence summarization task, while the image is not involved in the process of text encoding. Obviously, it will be easier for the decoder to generate an accurate summary if the encoder can filter out trivial information when encoding the input sentence. Based on this idea, paper proposes a multimodal selective mechanism which aims to select the highlights from the input text using visual signals, and then the decoder generates the summary using the filtered encoding information. Concretely, an encoder reads the input text and generates the hidden representations. Then, multimodal selective gates measure the relevance between the input words and the image to construct the selected hidden representation. Finally, a decoder generates the summary using the selected hidden representation.

  • Paper Link : https://aclanthology.org/2020.coling-main.496.pdf
  • Dataset : Public multimodal sentence summarization dataset
  • Model : Encoder with selective gates
  • Pretrained models : GRU and VGG19

Contributions

The contributions can be summarized as follows:

  • Proposed a novel multimodal selective mechanism that can use both the textual and visual signals to select the important information from the source text.
  • Proposed a visual-guided modality regularization module to encourage the model focus on the key information in the source.
  • The experimental results on a multimodal sentence summarization dataset demonstrate that the proposed system can take advantage of multimodal information and outperform baseline methods.

Summary

The input of the multimodal text summarization task is a pair of text and image, and the output is a textual summary. The paper proposes visual selective gates to encourage important source information encoded into the second-level hidden sequence and argues that the text can be pertinent to visual information at the level of the whole image, the image parts, and the object proposals in the image. Designed three visual selective gates: global-level, grid-level, and object-level gates. Next, the summary decoder produces the summary based on the second-level hidden states. The baselines which are compared are :

  • Lead
  • Compress
  • ABS
  • SEASS
  • Multi-Source
  • Doubly-attentive
  • Seq2Seq
  • PGNet
  • MAtt

The metrics used are :

  • ROUGE-{1,2,L}