Convolutional Hierarchical Attention Network for Query-Focused Video Summarization

2 minute read

Published:

There are three differences between queryfocused video summarization and generic video summarization :

  • Firstly, the video summary needs to take the subjectivity of users into account, as different user queries may receive different video summaries.
  • Secondly, trained video summarizers cannot meet all the users’ preferences and the performance evaluation is often to measure the temporal overlap, makes it hard to capture the semantic similarity between summaries and original videos.
  • Thirdly, the textual query will bring additional semantic information to the task.

  • Paper Link : https://arxiv.org/pdf/2002.03740.pdf
  • Dataset : UTE dataset
  • Model : CHAN : Feature Encoding Network, Fully Convolutional Block, Information Fusion Layer, Deconvolutional Layer

Contributions

The main contributions are as follows:

  • Proposed a novel method named Convolutional Hierarchical Attention Network (CHAN), which is based on convolution network and global-local attention mechanism and able to generate video query-related summary in parallel.
  • Presents a feature encoding network to learn the features of each video shot. Inside the feature encoding network, we employ fully convolutional network with local self-attention mechanism and query-aware global attention mechanism to obtain features with more semantic information.
  • Employed a query-relevance computing module which takes the feature of video shot and query as input and then calculate the similarity score. Query-related video summary is generated based on similarity score.
  • Performed extensive experiments on the benchmark datasets to demonstrate the effectiveness and efficiency of our model.

Summary

The framework of the proposed convolutional hierarchical attention network(CHAN). The paper first splits the original long video into several segments and then extract visual features using pretrained network. Then sends the visual features to feature encoding network to learning shot-level features. In the feature encoding network, the paper proposes a local self-attention module to learn the high-level semantic information in a video segment and employ a query-aware global-attention module to handle the semantic relationship between all segments and the given query. In order to reduce the parameter in the model, the paper employs fully convolutional block to decrease the dimension of visual feature and shorten the length of the shot sequence. Finally, all the aggregated shot-level visual features will be sent to query-relevance ranking module to obtain similarity score and the query-related video summary will be generated based on the similarity score.

The baselines which are compared are :

  • SeqDPP
  • SH-DPP
  • QC-DPP
  • TPAN

The metrics used are :

  • F1-Score