山世光，学者主页-中国科技论文在线

山世光

博士教授博士生导师

中国科学院计算技术研究所中国科学院智能信息处理重点实验室

以人脸识别为典型案例的计算机视觉和机器学习理论、方法和关键技术；基于视觉的情感计算；认知神经科学和脑科学

个性化签名

TA的关注(0) 关注TA的(0)

留言板

暂无留言

主页成果学术会议学者精选辑更多功能敬请期待

姓名：山世光
目前身份：在职研究人员
担任导师情况：博士生导师
学位：博士
学术头衔：

博士生导师
职称：高级-教授
学科领域：

模式识别
研究兴趣：以人脸识别为典型案例的计算机视觉和机器学习理论、方法和关键技术；基于视觉的情感计算；认知神经科学和脑科学

个人简介

山世光，中科院计算所研究员、博导，现任中科院智能信息处理重点实验室常务副主任。

1993-08--1997-08 哈尔滨工业大学学士；1997-09--1999-07 哈尔滨工业大学硕士；1999-09--2004-07 中国科学院计算技术研究所博士；2013-12--2015-01 卡内基梅隆大学访问学者；2010-10~现在, 中国科学院计算技术研究所, 研究员；2011-11~现在, 中国科学院智能信息处理重点实验室, 副主任；2013-03~现在, 中国科学院智能信息处理重点实验室, 常务副主任。

他的研究领域为计算机视觉和机器学习。已在国内外刊物和学术会议上发表论文300余篇，其中CCF A类论文90余篇，论文被谷歌学术引用20000余次。所研发的人脸识别相关研究成果获2005年度国家科技进步二等奖（第3完成人），在高维、非线性视觉模式分析方面的研究成果获2015年度国家自然科学二等奖（第2完成人），视觉流形建模与学习方面的研究成果获CVPR2008 Best Student Poster Award Runner-up奖。他带领团队研发的人脸识别技术已应用于公安部门、华为等众多产品或系统中，取得了良好的经济和社会效益。曾应邀担任过ICCV11，ACCV12/16/18，ICPR12/14/20，FG13/18/20，ICASSP14，BTAS18, AAAI20/21, IJCAI21, CVPR19/20/21等十余次领域主流国际会议的领域主席，现/曾任IEEE TIP, CVIU, PRL, Neurocomputing, FCS等国际学术刊物的编委(AE)。他是基金委优青，国家重要人才计划入选者，CCF青年科学家奖获得者，北京市科技新星，中科院青促会优秀会员。

他的研究兴趣集中于以人脸识别为典型案例的计算机视觉和机器学习理论、方法和关键技术上，特别是在人脸识别领域有超过20年的研究经验。近年来开始重点关注基于视觉的深度情感计算，如远距离、无接触生理信号估计，心理状态估计，精神状态评估等。在理论和算法层面，他和团队有非常丰富的机器学习特别是深度学习研究经验，尤其关注X数据条件下深度结合“知识”的机器学习理论和方法，这里所谓的X数据包括小数据、无监督数据、半监督数据、弱监督数据、脏数据、增广数据等等。

他是视觉与学习青年研讨会（VALSE）的共同发起人，VALSE指导委员会首届轮值主席，VALSE在线学术报告会（VALSE Webinar）活动的共同发起人和首届在线组委会主席。VALSE2019（合肥）参加人数超过了5000人，而VALSE Webinar的高峰参加人数达到了1800人，成为国内计算机视觉领域影响力最大的系列学术会议之一。

作为个人兴趣，他深切关注认知神经科学和脑科学的进展，并乐于思考和讨论生物视觉的本质问题，以及脑神经科学给视觉计算带来的启示。

主页访问

87
关注数

0
成果阅读

589
成果数

15

TA的成果

上传时间

2020-11-04

【期刊论文】A comparative study on illumination preprocessing in face recognition

Pattern Recognition，2013，46（6）：1691-1699

2013年06月01日

摘要

Illumination preprocessing is an effective and efficient approach in handling lighting variations for face recognition. Despite much attention to face illumination preprocessing, there is seldom systemic comparative study on existing approaches that presents fascinating insights and conclusions in how to design better illumination preprocessing methods. To fill this vacancy, we provide a comparative study of 12 representative illumination preprocessing methods (HE, LT, GIC, DGD, LoG, SSR, GHP, SQI, LDCT, LTV, LN and TT) from two novel perspectives: (1) localization for holistic approach and (2) integration of large-scale and small-scale feature bands. Experiments on public face databases (YaleBExt, CMU-PIE, CAS-PEAL and FRGC V2.0) with illumination variations suggest that localization for holistic illumination preprocessing methods (HE, GIC, LTV and TT) further improves the performance. Integration of large-scale and small-scale feature bands for reflectance field estimation based illumination preprocessing approaches (SSR, GHP, SQI, LDCT, LTV and TT) is also found helpful for illumination-insensitive face recognition.

无

35浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Adaptive discriminant learning for face recognition

Pattern Recognition，2013，46（9）：2497-2509

2013年09月01日

摘要

Face recognition from Single Sample per Person (SSPP) is extremely challenging because only one sample is available for each person. While many discriminant analysis methods, such as Fisherfaces and its numerous variants, have achieved great success in face recognition, these methods cannot work in this scenario, because more than one sample per person are needed to calculate the within-class scatter matrix. To address this problem, we propose Adaptive Discriminant Analysis (ADA) in which the within-class scatter matrix of each enrolled subject is inferred using his/her single sample, by leveraging a generic set with multiple samples per person. Our method is motivated from the assumption that subjects who look alike to each other generally share similar within-class variations. In ADA, a limited number of neighbors for each single sample are first determined from the generic set by using kNN regression or Lasso regression. Then, the within-class scatter matrix of this single sample is inferred as the weighted average of the within-class scatter matrices of these neighbors based on the arithmetic mean or Riemannian mean. Finally, the optimal ADA projection directions can be computed analytically by using the inferred within-class scatter matrices and the actual between-class scatter matrix. The proposed method is evaluated on three databases including FERET database, FRGC database and a large real-world passport-like face database. The extensive results demonstrate the effectiveness of our ADA when compared with the existing solutions to the SSPP problem.

无

27浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】CovGa: A novel descriptor based on symmetry of regions for head pose estimation

Neurocomputing，2014，143（）：97-108

2014年11月02日

摘要

This paper proposes a novel method to estimate the head yaw rotation using the symmetry of regions. We argue that the symmetry of 2D regions located in the same horizontal row is more intrinsically relevant to the yaw rotation of head than the symmetry of 1D signals, while at the same time insensitive to the identity of the face. Specifically, the proposed method relies on the effective combination of Gabor filters and covariance descriptors. We first extract the multi-scale and multi-orientation Gabor representations of the input face image, and then use covariance descriptors to compute the symmetry between two regions in terms of Gabor representations under the same scale and orientation. Since the covariance matrix can alleviate the influence caused by rotations and illumination, the proposed method is robust to such variations. In addition, the proposed method is further improved by combining it with a metric learning method named aa KISS MEtric learning (KISSME). Experiments on four challenging databases demonstrated that the proposed method outperformed the state of the art.

Head pose estimation， Covariance descriptors， Gabor filters， Symmetry

27浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Data-driven hair segmentation with isomorphic manifold inference

Image and Vision Computing，2014，32（10）：739-750

2014年10月01日

摘要

Hair segmentation is challenging due to the diverse appearance, irregular region boundary and the influence of complex background. To deal with this problem, we propose a novel data-driven method, named Isomorphic Manifold Inference (IMI). The IMI method assumes the coarse probability map and the binary segmentation map as a couple of isomorphic manifolds and tries to learn hair specific priors from manually labeled training images. For an input image, firstly, the method calculates a coarse probability map. Then it exploits regression techniques to obtain the relationship between the coarse probability map of the test image and those of training images. Finally, this relationship, i.e., a coefficient set, is transferred to the binary segmentation maps and a soft segmentation of the test image will be achieved by a linear combination of those binary maps. Further, we employ this soft segmentation as a shape cue and integrate it with color and texture cues into a unified segmentation framework. A better segmentation is achieved by the Graph Cuts optimization. Extensive experiments are conducted to validate effectiveness of the IMI method, compare contributions of different cues and investigate the generalization of IMI method. The results strongly encourage our method.

Hair segmentation， Data driven， Shape model， Isomorphic manifold inference

33浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Maximal Likelihood Correspondence Estimation for Face Recognition Across Pose

IEEE Transactions on Image Processing，2014，23（10）：4587 - 460

2014年08月22日

摘要

Due to the misalignment of image features, the performance of many conventional face recognition methods degrades considerably in across pose scenario. To address this problem, many image matching-based methods are proposed to estimate semantic correspondence between faces in different poses. In this paper, we aim to solve two critical problems in previous image matching-based correspondence learning methods: 1) fail to fully exploit face specific structure information in correspondence estimation and 2) fail to learn personalized correspondence for each probe image. To this end, we first build a model, termed as morphable displacement field (MDF), to encode face specific structure information of semantic correspondence from a set of real samples of correspondences calculated from 3D face models. Then, we propose a maximal likelihood correspondence estimation (MLCE) method to learn personalized correspondence based on maximal likelihood frontal face assumption. After obtaining the semantic correspondence encoded in the learned displacement, we can synthesize virtual frontal images of the profile faces for subsequent recognition. Using linear discriminant analysis method with pixel-intensity features, state-of-the-art performance is achieved on three multipose benchmarks, i.e., CMU-PIE, FERET, and MultiPIE databases. Owe to the rational MDF regularization and the usage of novel maximal likelihood objective, the proposed MLCE method can reliably learn correspondence between faces in different poses even in complex wild environment, i.e., labeled face in the wild database.

无

45浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Domain Adaptation for Face Recognition: Targetize Source Domain Bridged by Common Subspace

International Journal of Computer Vision ，2013，109（）：pages94–10

2013年12月31日

摘要

In many applications, a face recognition model learned on a source domain but applied to a novel target domain degenerates even significantly due to the mismatch between the two domains. Aiming at learning a better face recognition model for the target domain, this paper proposes a simple but effective domain adaptation approach that transfers the supervision knowledge from a labeled source domain to the unlabeled target domain. Our basic idea is to convert the source domain images to target domain (termed as targetize the source domain hereinafter), and at the same time keep its supervision information. For this purpose, each source domain image is simply represented as a linear combination of sparse target domain neighbors in the image space, with the combination coefficients however learnt in a common subspace. The principle behind this strategy is that, the common knowledge is only favorable for accurate cross-domain reconstruction, but for the classification in the target domain, the specific knowledge of the target domain is also essential and thus should be mostly preserved (through targetization in the image space in this work). To discover the common knowledge, specifically, a common subspace is learnt, in which the structures of both domains are preserved and meanwhile the disparity of source and target domains is reduced. The proposed method is extensively evaluated under three face recognition scenarios, i.e., domain adaptation across view angle, domain adaptation across ethnicity and domain adaptation across imaging condition. The experimental results illustrate the superiority of our method over those competitive ones.

无

35浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Face recognition on large-scale video in the wild with hybrid Euclidean-and-Riemannian metric learning

Pattern Recognition，2015，48（10）：3113-3124

2015年10月01日

摘要

Face recognition on large-scale video in the wild is becoming increasingly important due to the ubiquity of video data captured by surveillance cameras, handheld devices, Internet uploads, and other sources. By treating each video as one image set, set-based methods recently have made great success in the field of video-based face recognition. In the wild world, videos often contain extremely complex data variations and thus pose a big challenge of set modeling for set-based methods. In this paper, we propose a novel Hybrid Euclidean-and-Riemannian Metric Learning (HERML) method to fuse multiple statistics of image set. Specifically, we represent each image set simultaneously by mean, covariance matrix and Gaussian distribution, which generally complement each other in the aspect of set modeling. However, it is not trivial to fuse them since mean, covariance matrix and Gaussian model typically lie in multiple heterogeneous spaces equipped with Euclidean or Riemannian metric. Therefore, we first implicitly map the original statistics into high dimensional Hilbert spaces by exploiting Euclidean and Riemannian kernels. With a LogDet divergence based objective function, the hybrid kernels are then fused by our hybrid metric learning framework, which can efficiently perform the fusing procedure on large-scale videos. The proposed method is evaluated on four public and challenging large-scale video face datasets. Extensive experimental results demonstrate that our method has a clear superiority over the state-of-the-art set-based methods for large-scale video-based face recognition.

Face recognition， Large-scale video， Multiple heterogeneous statistics， Hybrid Euclidean-and-Riemannian metric learning

42浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Learning prototypes and similes on Grassmann manifold for spontaneous expression recognition

Computer Vision and Image Understanding，2016，147（）：95-101

2016年06月01日

摘要

Video-based spontaneous expression recognition is a challenging task due to the large inter-personal variations of both the expressing manners and the executing rates for the same expression category. One of the key is to explore robust representation method which can effectively capture the facial variations as well as alleviate the influence of personalities. In this paper, we propose to learn a kind of typical patterns that can be commonly shared by different subjects when performing expressions, namely “prototypes”. Specifically, we first apply a statistical model (i.e. linear subspace) on facial regions to generate the specific expression patterns for each video. Then a clustering algorithm is employed on all these expression patterns and the cluster means are regarded as the “prototypes”. Accordingly, we further design “simile” features to measure the similarities of personal specific patterns to our learned “prototypes”. Both techniques are conducted on Grassmann manifold, which can enrich the feature encoding manners and better reveal the data structure by introducing intrinsic geodesics. Extensive experiments are conducted on both posed and spontaneous expression databases. All results show that our method outperforms the state-of-the-art and also possesses good transferable ability under cross-database scenario.

Expression prototype， Simile representation， Grassmann manifold， Spontaneous expression recognition

34浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Learning Expressionlets via Universal Manifold Model for Dynamic Facial Expression Recognition

IEEE Transactions on Image Processing，2016，25（12）： 5920 - 59

2016年10月05日

摘要

Facial expression is a temporally dynamic event which can be decomposed into a set of muscle motions occurring in different facial regions over various time intervals. For dynamic expression recognition, two key issues, temporal alignment and semantics-aware dynamic representation, must be taken into account. In this paper, we attempt to solve both problems via manifold modeling of videos based on a novel mid-level representation, i.e., expressionlet. Specifically, our method contains three key stages: 1) each expression video clip is characterized as a spatial-temporal manifold (STM) formed by dense low-level features; 2) a universal manifold model (UMM) is learned over all low-level features and represented as a set of local modes to statistically unify all the STMs; and 3) the local modes on each STM can be instantiated by fitting to the UMM, and the corresponding expressionlet is constructed by modeling the variations in each local mode. With the above strategy, expression videos are naturally aligned both spatially and temporally. To enhance the discriminative power, the expressionlet-based STM representation is further processed with discriminant embedding. Our method is evaluated on four public expression databases, CK+, MMI, Oulu-CASIA, and FERA. In all cases, our method outperforms the known state of the art by a large margin.

无

25浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Spatial Pyramid Covariance-Based Compact Video Code for Robust Face Retrieval in TV-Series

IEEE Transactions on Image Processing ，2016，25（12）： 5905 - 59

2016年10月10日

摘要

We address the problem of face video retrieval in TV-series, which searches video clips based on the presence of specific character, given one face track of his/her. This is tremendously challenging because on one hand, faces in TV-series are captured in largely uncontrolled conditions with complex appearance variations, and on the other hand, retrieval task typically needs efficient representation with low time and space complexity. To handle this problem, we propose a compact and discriminative representation for the huge body of video data, named compact video code (CVC). Our method first models the face track by its sample (i.e., frame) covariance matrix to capture the video data variations in a statistical manner. To incorporate discriminative information and obtain more compact video signature suitable for retrieval, the high-dimensional covariance representation is further encoded as a much lower dimensional binary vector, which finally yields the proposed CVC. Specifically, each bit of the code, i.e., each dimension of the binary vector, is produced via supervised learning in a max margin framework, which aims to make a balance between the discriminability and stability of the code. Besides, we further extend the descriptive granularity of covariance matrix from traditional pixel-level to more general patch-level, and proceed to propose a novel hierarchical video representation named spatial pyramid covariance along with a fast calculation method. Face retrieval experiments on two challenging TV-series video databases, i.e., the Big Bang Theory and Prison Break, demonstrate the competitiveness of the proposed CVC over the state-of-the-art retrieval methods. In addition, as a general video matching algorithm, CVC is also evaluated in traditional video face recognition task on a standard Internet database, i.e., YouTube Celebrities, showing its quite promising performance by using an extremely compact code with only 128 bits.

无

40浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】AttGAN: Facial Attribute Editing by Only Changing What You Want

IEEE Transactions on Image Processing ，2019，28（11）： 5464 - 54

2019年05月20日

摘要

Facial attribute editing aims to manipulate single or multiple attributes on a given face image, i.e., to generate a new face image with desired attributes while preserving other details. Recently, the generative adversarial net (GAN) and encoder-decoder architecture are usually incorporated to handle this task with promising results. Based on the encoder-decoder architecture, facial attribute editing is achieved by decoding the latent representation of a given face conditioned on the desired attributes. Some existing methods attempt to establish an attribute-independent latent representation for further attribute editing. However, such attribute-independent constraint on the latent representation is excessive because it restricts the capacity of the latent representation and may result in information loss, leading to over-smooth or distorted generation. Instead of imposing constraints on the latent representation, in this work, we propose to apply an attribute classification constraint to the generated image to just guarantee the correct change of desired attributes, i.e., to change what you want. Meanwhile, the reconstruction learning is introduced to preserve attribute-excluding details, in other words, to only change what you want. Besides, the adversarial learning is employed for visually realistic editing. These three components cooperate with each other forming an effective framework for high quality facial attribute editing, referred as AttGAN. Furthermore, the proposed method is extended for attribute style manipulation in an unsupervised manner. Experiments on two wild datasets, CelebA and LFW, show that the proposed method outperforms the state-of-the-art on realistic attribute editing with other facial details well preserved.

无

37浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Hierarchical Attention for Part-Aware Face Detection

International Journal of Computer Vision volume，2019，127（）：pages560–5

2019年03月02日

摘要

Expressive representations for characterizing face appearances are essential for accurate face detection. Due to different poses, scales, illumination, occlusion, etc, face appearances generally exhibit substantial variations, and the contents of each local region (facial part) vary from one face to another. Current detectors, however, particularly those based on convolutional neural networks, apply identical operations (e.g. convolution or pooling) to all local regions on each face for feature aggregation (in a generic sliding-window configuration), and take all local features as equally effective for the detection task. In such methods, not only is each local feature suboptimal due to ignoring region-wise distinctions, but also the overall face representations are semantically inconsistent. To address the issue, we design a hierarchical attention mechanism to allow adaptive exploration of local features. Given a face proposal, part-specific attention modeled as learnable Gaussian kernels is proposed to search for proper positions and scales of local regions to extract consistent and informative features of facial parts. Then face-specific attention predicted with LSTM is introduced to model relations between the local parts and adjust their contributions to the detection tasks. Such hierarchical attention leads to a part-aware face detector, which forms more expressive and semantically consistent face representations. Extensive experiments are performed on three challenging face detection datasets to demonstrate the effectiveness of our hierarchical attention and make comparisons with state-of-the-art methods.

无

52浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（11）： 2597 - 26

2017年08月11日

摘要

Face attribute estimation has many potential applications in video surveillance, face retrieval, and social media. While a number of methods have been proposed for face attribute estimation, most of them did not explicitly consider the attribute correlation and heterogeneity (e.g., ordinal versus nominal and holistic versus local) during feature representation learning. In this paper, we present a Deep Multi-Task Learning (DMTL) approach to jointly estimate multiple heterogeneous attributes from a single face image. In DMTL, we tackle attribute correlation and heterogeneity with convolutional neural networks (CNNs) consisting of shared feature learning for all the attributes, and category-specific feature learning for heterogeneous attributes. We also introduce an unconstrained face database (LFW+), an extension of public-domain LFW, with heterogeneous demographic attributes (age, gender, and race) obtained via crowdsourcing. Experimental results on benchmarks with multiple face attributes (MORPH II, LFW+, CelebA, LFWA, and FotW) show that the proposed approach has superior performance compared to state of the art. Finally, evaluations on a public-domain face database (LAP) with a single attribute show that the proposed approach has excellent generalization ability.

无

75浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Funnel-structured cascade for multi-view face detection with alignment-awareness

Neurocomputing，2017，221（）：138-145

2017年01月19日

摘要

Multi-view face detection in open environment is a challenging task due to diverse variations of face appearances and shapes. Most multi-view face detectors depend on multiple models and organize them in parallel, pyramid or tree structure, which compromise between the accuracy and time-cost. Aiming at a more favorable multi-view face detector, we propose a novel funnel-structured cascade (FuSt) detection framework. In a coarse-to-fine flavor, our FuSt consists of, from top to bottom, (1) multiple view-specific fast LAB cascade for extremely quick face proposal, (2) multiple coarse MLP cascade for further candidate window verification, and (3) a unified fine MLP cascade with shape-indexed features for accurate face detection. Compared with other structures, on the one hand, the proposed one uses multiple computationally efficient distributed classifiers to propose a small number of candidate windows but with a high recall of multi-view faces. On the other hand, by using a unified MLP cascade to examine proposals of all views in a centralized style, it provides a favorable solution for multi-view face detection with high accuracy and low time–cost. Besides, the FuSt detector is alignment-aware and performs a coarse facial part prediction which is beneficial for subsequent face alignment. Extensive experiments on two challenging datasets, FDDB and AFW, demonstrate the effectiveness of our FuSt detector in both accuracy and speed.

无

37浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-11-04

【期刊论文】Relative Forest for Visual Attribute Prediction

IEEE Transactions on Image Processing，2016，25（9）：3991 - 400

2016年06月14日

摘要

Accurate prediction of the visual attributes is significant in various recognition tasks. For many visual attributes, while it is very difficult to describe the exact degrees of their presences, by comparing the pairs of samples, the relative ordering of presences may be easily figured out. Based on this observation, instead of considering such attribute as binary attribute, the relative attribute method learns a ranking function for each attribute to provide more accurate and informative prediction results. In this paper, we also explore pairwise ranking for visual attribute prediction and propose to improve the relative attribute method in two aspects. First, we propose a relative tree method, which can achieve more accurate ranking in case of nonlinearly distributed visual data. Second, by resorting to randomization and ensemble learning, the relative tree method is extended to the relative forest method to further boost the accuracy and simultaneously reduce the computational cost. To validate the effectiveness of the proposed methods, we conduct extensive experiments on four databases: PubFig, OSR, FGNET, and WebFace. The results show that the proposed relative forest method not only outperforms the original relative attribute method, but also achieve the state-of-the-art accuracy for ordinal visual attribute prediction.

无