中国科技论文在线

上传时间

2020年11月12日

【期刊论文】Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks

IEEE Transactions on Image Processing，2019，28（8）：3860 - 387

2019年02月27日

Multi-turn video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In this paper, we study the problem of multi-turn video question answering from the viewpoint of multi-stream hierarchical attention context reinforced network learning. We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure. We then develop the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context-aware question embedding. We next devise a multi-step reasoning process to enhance the multi-stream hierarchical attention context network learning method. We finally predict the multiple-choice answer from the candidate answer set and further develop the reinforced decoder network to generate the open-ended natural language answer for multi-turn video question answering. We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method.

无

0

42浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks

IEEE Transactions on Image Processing，2019，28（12）：5939 - 595

2019年06月17日

摘要

Open-ended long-form video question answering is a challenging task in visual information retrieval, which automatically generates a natural language answer from the referenced long-form video contents according to a given question. However, the existing works mainly focus on short-form video question answering, due to the lack of modeling semantic representations from long-form video contents. In this paper, we introduce a dynamic hierarchical reinforced network for open-ended long-form video question answering, which employs an encoder-decoder architecture with a dynamic hierarchical encoder and a reinforced decoder. Concretely, we first propose a frame-level dynamic long-short term memory (LSTM) network with binary segmentation gate to learn frame-level semantic representations according to the given question. We then develop a segment-level highway LSTM network with a question-aware highway gate for segment-level semantic modeling. Furthermore, we devise the reinforced decoder with a hierarchical attention mechanism to generate natural language answers. We construct a large-scale long-form video question answering dataset. The extensive experiments on the long-form dataset and another public short-form dataset show the effectiveness of our method.

无

0

34浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos

IEEE Transactions on Multimedia，-0001，21（1）：246 - 255

-1年11月30日

摘要

Abnormal event detection in large videos is an important task in research and industrial applications, which has attracted considerable attention in recent years. Existing methods usually solve this problem by extracting local features and then learning an outlier detection model on training videos. However, most previous approaches merely employ hand-crafted visual features, which is a clear disadvantage due to their limited representation capacity. In this paper, we present a novel unsupervised deep feature learning algorithm for the abnormal event detection problem. To exploit the spatiotemporal information of the inputs, we utilize the deep three-dimensional convolutional network (C3D) to perform feature extraction. Then, the key problem is how to train the C3D network without any category labels. Here, we employ the sparse coding results of the hand-crafted features generated from the inputs to guide the unsupervised feature learning. Specifically, we define a multilevel similarity relationship between these inputs according to the statistical information of the shared atoms. In the following, we introduce the quadruplet concept to model the multilevel similarity structure, which could be used to construct a generalized triplet loss for training the C3D network. Furthermore, the C3D network could be utilized to generate the features for sparse coding again, and this pipeline could be iterated for several times. By jointly optimizing between the sparse coding and the unsupervised feature learning, we can obtain robust and rich feature representations. Based on the learned representations, the sparse reconstruction error is applied to predicting the anomaly score of each testing input. Experiments on several publicly available video surveillance datasets in comparison with a number of existing works demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

无

0

35浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Deep feature based contextual model for object detection

Neurocomputing，2018，275（）：1035-1042

2018年01月31日

摘要

One of the most active areas in computer vision is object detection, which has made significant improvement in recent years. Current state-of-the-art object detection methods mostly adhere to the framework of the regions with convolutional neural network (R-CNN). However, they only take advantage of the local appearance features inside object bounding boxes. Since these approaches ignore the contextual information around the object proposals, the outcome of these detectors may generate a semantically incoherent interpretation of the input image. In this paper, we propose a novel object detection system which incorporates the local appearance and the contextual information. Specifically, the contextual information comprises the relationships among objects and the global scene based contextual feature generated by a convolutional neural network. The whole system is formulated as a fully connected conditional random field (CRF) defined on object proposals. Then the contextual constraints among object proposals are modeled as edges naturally. Furthermore, a fast mean field approximation method is utilized to infer in this CRF model efficiently. The experimental results demonstrate that our algorithm achieves a higher mean average precision (mAP) on PASCAL VOC 2007 datasets compared with the baseline algorithm Faster R-CNN.

Object detection， Context information， Conditional random field

0

46浏览
0点赞
0收藏
0分享
0下载
0

引用

上传时间

2020年11月12日

【期刊论文】Question retrieval for community-based question answering via heterogeneous social influential network

Neurocomputing，2018，285（）：117-124

2018年04月12日

摘要

Community-based question answering platforms have attracted substantial users to share knowledge and learn from each other. As the rapid enlargement of community-based question answering (CQA) platforms, quantities of overlapped questions emerge, which makes users confounded to select a proper reference. It is urgent for us to take effective automated algorithms to reuse historical questions with corresponding answers. In this paper, we focus on the problem with question retrieval, which aims to match historical questions that are relevant or semantically equivalent to resolve one’s query directly. The challenges in this task are the lexical gaps between questions for the word ambiguity and word mismatch problem. Furthermore, limited words in queried sentences cause sparsity of word features. To alleviate these challenges, we propose a novel framework named HSIN which encodes not only the question contents but also the asker’s social interactions to enhance the question embedding performance. More specifically, we apply random walk based learning method with recurrent neural network to match the similarities between asker’s question and historical questions proposed by other users. Extensive experiments on a large-scale dataset from a real world CQA site Quora show that employing the heterogeneous social network information outperforms the other state-of-the-art solutions in this task.

CQA， Question retrieval， Deep learning， Social network

0

28浏览
0点赞
0收藏
0分享
0下载
0

引用