Multi Modal Research Survey For Summarization

2021

Self-Supervised Multimodal Opinion Summarization

1 minute read

Published:

Opinion summarization is the task of automatically generating summaries from multiple documents containing users’ thoughts on businesses or products. This summarization of users’ opinions can provide information that helps other users with their decision-making on consumption.

Multi-Modal Supplementary-Complementary Summarization using Multi-Objective Optimization

2 minute read

Published:

When dealing with multi-modal information retrieval tasks, the extent to which a particular modality contributes to the final output might differ from other modalities. Amongst the modalities, there is often a preferable mode of representation based on the significance and ability to fulfill the task. We denote these preferred modalities as key modalities or central modalities (will be referred to as central modalities from here onwards). The other modalities help assist the central modalities in fulfilling the desired task, and are known as adjacent modalities. The adjacent modalities can enhance the user experience by either supplementing or by complementing the information represented via the central modality. When these adjacent modalities reinforce the facts and ideas presented in central modality, the enhancement is known as supplementary enhancement. On the other hand, when these adjacent modalities complete the central modality, by providing additional or alternate information that is relevant, albeit not covered by the central modality, the enhancement is known as complementary enhancement.

Multimodal Sentence Summarization via Multimodal Selective Encoding

1 minute read

Published:

Li et. al proposed a hierarchical attention model for the multimodal sentence summarization task, while the image is not involved in the process of text encoding. Obviously, it will be easier for the decoder to generate an accurate summary if the encoder can filter out trivial information when encoding the input sentence. Based on this idea, paper proposes a multimodal selective mechanism which aims to select the highlights from the input text using visual signals, and then the decoder generates the summary using the filtered encoding information. Concretely, an encoder reads the input text and generates the hidden representations. Then, multimodal selective gates measure the relevance between the input words and the image to construct the selected hidden representation. Finally, a decoder generates the summary using the selected hidden representation.

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

1 minute read

Published:

Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, the paper proposed a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module.

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

1 minute read

Published:

In real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, Video-based Multimodal Summarization with Multimodal Output (VMSMO) is proposed, which selects cover frame from news video and generates textual summary of the news article in the meantime

Multi-modal Summarization for Video-containing Documents

1 minute read

Published:

Existing models suffer from the following drawbacks:

  • Most existing applications extract visual information from the accompanying images, but they ignore related videos. The paper contends that videos contain abundant contents and have temporal characteristics where events are represented chronologically, which are crucial for text summarization.
  • Although attention mechanism and early fusion are used extensively, it adversely introduces noise as it is unsuitable for multi-modal data without alignment, which is characterized by a large gap that requires intensive communication.
  • Various multi-modal summarization works have focused on a single task, such as text or video summarization with added information from other modalities. Paper observes that both summarization tasks share the same target of refining original long materials, and as such they can be performed jointly due to common characteristics.

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization

2 minute read

Published:

There are three differences between queryfocused video summarization and generic video summarization :

  • Firstly, the video summary needs to take the subjectivity of users into account, as different user queries may receive different video summaries.
  • Secondly, trained video summarizers cannot meet all the users’ preferences and the performance evaluation is often to measure the temporal overlap, makes it hard to capture the semantic similarity between summaries and original videos.
  • Thirdly, the textual query will bring additional semantic information to the task.

Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products

1 minute read

Published:

Commercial product advertisements, as a critical component of marketing management in e-commerce platforms, aim to attract consumers’ interests and arouse consumers’ desires to purchase the products. However, most product advertisements are so miscellaneous and tedious that the consumers cannot be expected to be patient enough to carefully read through them.

Multimodal Summarization of Complex Sentences

1 minute read

Published:

This paper introduces ROCMMS, a system that automatically converts existing text to multimodal summaries (MMS) that capture the meaning of a complex sentence in a diagram containing pictures and simplified text related by structure extracted from the original sentence.

CLIP : Connecting Text and Images

less than 1 minute read

Published:

This paper is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance.