中国科技论文在线

上传时间

2020年11月12日

【期刊论文】Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction

IEEE Transactions on Image Processing，2020，29（）：3750 - 376

2020年01月17日

Moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including the syntactic dependencies of natural language queries, long-range semantic dependencies in video context and the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning and propose a multi-head self-attention to capture long-range semantic dependencies from video context. Next, we employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents, and we also consider query reconstruction from the cross-modal representations of target moment as an auxiliary task to strengthen the cross-modal representations. The extensive experiments on ActivityNet Captions and TACoS demonstrate the effectiveness of our proposed method.

无

0

53浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks

IEEE Transactions on Image Processing，2019，28（12）：5939 - 595

2019年06月17日

摘要

Open-ended long-form video question answering is a challenging task in visual information retrieval, which automatically generates a natural language answer from the referenced long-form video contents according to a given question. However, the existing works mainly focus on short-form video question answering, due to the lack of modeling semantic representations from long-form video contents. In this paper, we introduce a dynamic hierarchical reinforced network for open-ended long-form video question answering, which employs an encoder-decoder architecture with a dynamic hierarchical encoder and a reinforced decoder. Concretely, we first propose a frame-level dynamic long-short term memory (LSTM) network with binary segmentation gate to learn frame-level semantic representations according to the given question. We then develop a segment-level highway LSTM network with a question-aware highway gate for segment-level semantic modeling. Furthermore, we devise the reinforced decoder with a hierarchical attention mechanism to generate natural language answers. We construct a large-scale long-form video question answering dataset. The extensive experiments on the long-form dataset and another public short-form dataset show the effectiveness of our method.

无

0

34浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks

IEEE Transactions on Image Processing，2019，28（8）：3860 - 387

2019年02月27日

摘要

Multi-turn video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In this paper, we study the problem of multi-turn video question answering from the viewpoint of multi-stream hierarchical attention context reinforced network learning. We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure. We then develop the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context-aware question embedding. We next devise a multi-step reasoning process to enhance the multi-stream hierarchical attention context network learning method. We finally predict the multiple-choice answer from the candidate answer set and further develop the reinforced decoder network to generate the open-ended natural language answer for multi-turn video question answering. We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method.

无

0

42浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】On the Diversity of Conditional Image Synthesis With Semantic Layouts

IEEE Transactions on Image Processing，2019，28（6）：2898 - 290

2019年01月10日

摘要

Many image processing tasks can be formulated as translating images between two image domains such as colorization, super-resolution, and conditional image synthesis. In most of these tasks, an input image may correspond to multiple outputs. However, current existing approaches only show minor stochasticity of the outputs. In this paper, we present a novel approach to synthesize diverse realistic images corresponding to a semantic layout. We introduce a diversity loss objective that maximizes the distance between synthesized image pairs and relates the input noise to the semantic segments in the synthesized images. Thus, our approach can not only produce multiple diverse images but also allow users to manipulate the output images by adjusting the noise manually. The experimental results show that images synthesized by our approach are more diverse than that of the current existing works and equipping our diversity loss does not degrade the reality of the base networks. Moreover, our approach can be applied to unpaired datasets.

无

0

51浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Addressing the Item Cold-Start Problem by Attribute-Driven Active Learning

IEEE Transactions on Knowledge and Data Engineering，2019，32（4）：631 - 644

2019年01月09日

摘要

In recommender systems, cold-start issues are situations where no previous events, e.g., ratings, are known for certain users or items. In this paper, we focus on the item cold-start problem. Both content information (e.g., item attributes) and initial user ratings are valuable for seizing users' preferences on a new item. However, previous methods for the item cold-start problem either (1) incorporate content information into collaborative filtering to perform hybrid recommendation, or (2) actively select users to rate the new item without considering content information and then do collaborative filtering. In this paper, we propose a novel recommendation scheme for the item cold-start problem by leveraging both active learning and items' attribute information. Specifically, we design useful user selection criteria based on items' attributes and users' rating history, and combine the criteria in an optimization framework for selecting users. By exploiting the feedback ratings, users' previous ratings and items' attributes, we then generate accurate rating predictions for the other unselected users. Experimental results on two real-world datasets show the superiority of our proposed method over traditional methods.

无

0

72浏览
0点赞
0收藏
0分享
0下载
0

引用