Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

1 minute read

Published: October 13, 2021

Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, the paper proposed a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module.

Paper Link : https://aclanthology.org/2020.emnlp-main.144.pdf
Code : https://github.com/forkarinda/MFN
Dataset : How2 dataset
Model : Encoder-Decoder with fusion, forget gates and hierarchical fusion decoder.
Pretrained models : ResNetXt-101, ASR

Contributions

The contributions can be summarized as follows:

Proposed a multistage fusion network with the fusion forget gate module for multimodal summarization in videos. The model involves multiple information fusion processes to capture the correlation between multisource modalities spontaneously, and a fusion forget gate is proposed to effectively suppress the flow of unnecessary multimodal noise.

Summary

The overall architecture of the proposed model is a multistage fusion network which consists of the cross fusion block and hierarchical fusion decoder, which aims to model the correlation and complementarity between modalities spontaneously. In addition, the fusion forget gate is applied in the cross fusion block to filter the flow of redundant information streams. The model is built based on the RNN and transformer encoder-decoder architectures, respectively.

The baselines which are compared are :

S2S
PG
FT
VideoRNN
MT
HA

The metrics used are :

BLEU
ROUGE-{1,2,L}
METEOR
CIDEr

Share on

Twitter Facebook Google+ LinkedIn

Ashwin Pathak

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Contributions

Summary

Share on

You May Also Enjoy

GSOC 2017 - Week 4 of GSoC 17

GSOC 2017 - Week 3 of GSoC 17

GSOC 2017 - Week 2 of GSoC 17

GSOC 2017 - Week 1 of GSoC 17