Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

1 minute read

Published:

Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, the paper proposed a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module.

  • Paper Link : https://aclanthology.org/2020.emnlp-main.144.pdf
  • Code : https://github.com/forkarinda/MFN
  • Dataset : How2 dataset
  • Model : Encoder-Decoder with fusion, forget gates and hierarchical fusion decoder.
  • Pretrained models : ResNetXt-101, ASR

Contributions

The contributions can be summarized as follows:

  • Proposed a multistage fusion network with the fusion forget gate module for multimodal summarization in videos. The model involves multiple information fusion processes to capture the correlation between multisource modalities spontaneously, and a fusion forget gate is proposed to effectively suppress the flow of unnecessary multimodal noise.

Summary

The overall architecture of the proposed model is a multistage fusion network which consists of the cross fusion block and hierarchical fusion decoder, which aims to model the correlation and complementarity between modalities spontaneously. In addition, the fusion forget gate is applied in the cross fusion block to filter the flow of redundant information streams. The model is built based on the RNN and transformer encoder-decoder architectures, respectively.

The baselines which are compared are :

  • S2S
  • PG
  • FT
  • VideoRNN
  • MT
  • HA

The metrics used are :

  • BLEU
  • ROUGE-{1,2,L}
  • METEOR
  • CIDEr