Query-Biased Self-Attentive Network for Query-Focused Video Summarization
IEEE Transactions on Image Processing，2020，29（）：5889 - 589 | 2020年04月13日 | 10.1109/TIP.2020.2985868
This paper addresses the task of query-focused video summarization, which takes user queries and long videos as inputs and generates query-focused video summaries. Compared to video summarization, which mainly concentrates on finding the most diverse and representative visual contents as a summary, the task of query-focused video summarization considers the user's intent and the semantic meaning of generated summary. In this paper, we propose a method, named query-biased self-attentive network (QSAN) to tackle this challenge. Our key idea is to utilize the semantic information from video descriptions to generate a generic summary and then to combine the information from the query to generate a query-focused summary. Specifically, we first propose a hierarchical self-attentive network to model the relative relationship at three levels, which are different frames from a segment, different segments of the same video, textual information of video description and its related visual contents. We train the model on video caption dataset and employ a reinforced caption generator to generate a video description, which can help us locate important frames or shots. Then we build a query-aware scoring module to compute the query-relevant score for each shot and generate the query-focused summary. Extensive experiments on the benchmark dataset demonstrate the competitive performance of our approach compared to some methods.