俞凯
博士 教授 博士生导师
上海交通大学 计算机科学与工程系
长期从事人工智能、智能语音及语言处理及机器学习的研究和产业化工作。
个性化签名
- 姓名:俞凯
- 目前身份:在职研究人员
- 担任导师情况:博士生导师
- 学位:博士
-
学术头衔:
博士生导师
- 职称:高级-教授
-
学科领域:
人工智能
- 研究兴趣:长期从事人工智能、智能语音及语言处理及机器学习的研究和产业化工作。
俞凯,苏州思必驰信息科技有限公司首席科学家,上海交通大学计算机科学与工程系研究员,上海交通大学苏州人工智能研究院执行院长。
清华大学自动化系本科、硕士,剑桥大学工程系博士。长期从事人工智能、智能语音及语言处理及机器学习的研究和产业化工作。研究兴趣涉及语音识别、语音合成、口语理解、对话系统、认知型人机交互等智能语音语言处理技术的多个核心技术领域,发表国际期刊和会议论文120余篇,获得国际语音通信联盟(ISCA)2008-2012 Computer Speech and Language Best Paper Award等4篇国际期刊和会议最优论文奖,受邀担任InterSpeech、EUSIPCO等国际会议语音识别、口语对话系统领域主席。他搭建的大规模连续语音识别系统曾获得美国国家标准局(NIST)和美国国防部内部评测冠军;负责设计实现的认知型统计对话系统获得对话系统国际挑战赛可控测试冠军等。2014年获得中国人工智能学会颁发的“吴文俊人工智能科学技术奖”进步奖;获评“2016科学中国人年度人物”。他创立思必驰公司,进行智能语音及对话技术的产业化。作为中国人工智能领域创业公司的代表,思必驰入选2016高盛全球人工智能报告“AI Key Players”及2017年Gartner“Cool Vendors for AI”。
他是IEEE高级会员,目前中国大陆高校唯一的IEEE Speech and Language Processing Technical Committee 委员,中国计算机学会语音对话及听觉专业组副主任,中国声学学会语音语言、听觉及音乐分会执委会委员,中国人工智能产业发展联盟学术和知识产权组组长,中国语音产业联盟技术工作组副组长。
-
主页访问
102
-
关注数
1
-
成果阅读
978
-
成果数
22
【期刊论文】Kernel Nearest-Neighbor Algorithm
Neural Processing Letters ,2002,15():147–156
2002年04月01日
The ‘kernel approach’ has attracted great attention with the development of support vector machine (SVM) and has been studied in a general way. It offers an alternative solution to increase the computational power of linear learning machines by mapping data into a high dimensional feature space. This ‘approach’ is extended to the well-known nearest-neighbor algorithm in this paper. It can be realized by substitution of a kernel distance metric for the original one in Hilbert space, and the corresponding algorithm is called kernel nearest-neighbor algorithm. Three data sets, an artificial data set, BUPA liver disorders database and USPS database, were used for testing. Kernel nearest-neighbor algorithm was compared with conventional nearest-neighbor algorithm and SVM Experiments show that kernel nearest-neighbor algorithm is more powerful than conventional nearest-neighbor algorithm, and it can compete with SVM.
无
0
-
50浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
Cytometry,2002,48(4):202-208
2002年07月26日
Background Comparative genomic hybridization (CGH) is a relatively new molecular cytogenetic method that detects chromosomal imbalances. Automatic karyotyping is an important step in CGH analysis because the precise position of the chromosome abnormality must be located and manual karyotyping is tedious and time‐consuming. In the past, computer‐aided karyotyping was done by using the 4′,6‐diamidino‐2‐phenylindole, dihydrochloride (DAPI)‐inverse images, which required complex image enhancement procedures. Methods An innovative method, kernel nearest‐neighbor (K‐NN) algorithm, is proposed to accomplish automatic karyotyping. The algorithm is an application of the “kernel approach,” which offers an alternative solution to linear learning machines by mapping data into a high dimensional feature space. By implicitly calculating Euclidean or Mahalanobis distance in a high dimensional image feature space, two kinds of K‐NN algorithms are obtained. New feature extraction methods concerning multicolor information in CGH images are used for the first time. Results Experiment results show that the feature extraction method of using multicolor information in CGH images improves greatly the classification success rate. A high success rate of about 91.5% has been achieved, which shows that the K‐NN classifier efficiently accomplishes automatic chromosome classification from relatively few samples. Conclusions The feature extraction method proposed here and K‐NN classifiers offer a promising computerized intelligent system for automatic karyotyping of CGH human chromosomes. Cytometry 48:202–208, 2002. © 2002 Wiley‐Liss, Inc.
无
0
-
50浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Discriminative cluster adaptive training
IEEE Transactions on Audio, Speech, and Language Processing,2006,14(5):1694 - 170
2006年08月21日
Multiple-cluster schemes, such as cluster adaptive training (CAT) or eigenvoice systems, are a popular approach for rapid speaker and environment adaptation. Interpolation weights are used to transform a multiple-cluster, canonical, model to a standard hidden Markov model (HMM) set representative of an individual speaker or acoustic environment. Maximum likelihood training for CAT has previously been investigated. However, in state-of-the-art large vocabulary continuous speech recognition systems, discriminative training is commonly employed. This paper investigates applying discriminative training to multiple-cluster systems. In particular, minimum phone error (MPE) update formulae for CAT systems are derived. In order to use MPE in this case, modifications to the standard MPE smoothing function and the prior distribution associated with MPE training are required. A more complex adaptive training scheme combining both interpolation weights and linear transforms, a structured transform (ST), is also discussed within the MPE training framework. Discriminatively trained CAT and ST systems were evaluated on a state-of-the-art conversational telephone speech task. These multiple-cluster systems were found to outperform both standard and adaptively trained systems
无
0
-
41浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Bayesian Adaptive Inference and Adaptive Training
IEEE Transactions on Audio, Speech, and Language Processing,2007,15(6):1932 - 194
2007年07月23日
Large-vocabulary speech recognition systems are often built using found data, such as broadcast news. In contrast to carefully collected data, found data normally contains multiple acoustic conditions, such as speaker or environmental noise. Adaptive training is a powerful approach to build systems on such data. Here, transforms are used to represent the different acoustic conditions, and then a canonical model is trained given this set of transforms. This paper describes a Bayesian framework for adaptive training and inference. This framework addresses some limitations of standard maximum-likelihood approaches. In contrast to the standard approach, the adaptively trained system can be directly used in unsupervised inference, rather than having to rely on initial hypotheses being present. In addition, for limited adaptation data, robust recognition performance can be obtained. The limited data problem often occurs in testing as there is no control over the amount of the adaptation data available. In contrast, for adaptive training, it is possible to control the system complexity to reflect the available data. Thus, the standard point estimates may be used. As the integral associated with Bayesian adaptive inference is intractable, various marginalization approximations are described, including a variational Bayes approximation. Both batch and incremental modes of adaptive inference are discussed. These approaches are applied to adaptive training of maximum-likelihood linear regression and evaluated on a large-vocabulary speech recognition task. Bayesian adaptive inference is shown to significantly outperform standard approaches.
无
0
-
51浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Unsupervised Adaptation With Discriminative Mapping Transforms
IEEE Transactions on Audio, Speech, and Language Processing,2009,17(4):
2009年05月01日
The most commonly used approaches to speaker adaptation are based on linear transforms, as these can be robustly estimated using limited adaptation data. Although significant gains can be obtained using discriminative criteria for training acoustic models, maximum-likelihood (ML) estimated transforms are still used for unsupervised adaptation. This is because discriminatively trained transforms are highly sensitive to errors in the adaptation supervision hypothesis. This paper describes a new framework for estimating transforms that are discriminative in nature, but are less sensitive to this hypothesis issue. A speaker-independent discriminative mapping transformation (DMT) is estimated during training. This transform is obtained after a speaker-specific ML-estimated transform of each training speaker has been applied. During recognition an ML speaker-specific transform is found for each test-set speaker and the speaker-independent DMT then applied. This allows a transform which is discriminative in nature to be indirectly estimated, while only requiring an ML speaker-specific transform to be found during recognition. The DMT technique is evaluated on an English conversational telephone speech task. Experiments showed that using DMT in unsupervised adaptation led to significant gains over both standard ML and discriminatively trained transforms.
无
0
-
38浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
Computer Speech & Language,2010,24(2):150-174
2010年04月01日
This paper explains how Partially Observable Markov Decision Processes (POMDPs) can provide a principled mathematical framework for modelling the inherent uncertainty in spoken dialogue systems. It briefly summarises the basic mathematics and explains why exact optimisation is intractable. It then describes in some detail a form of approximation called the Hidden Information State model which does scale and which can be used to build practical systems. A prototype HIS system for the tourist information domain is evaluated and compared with a baseline MDP system using both user simulations and a live user trial. The results give strong support to the central contention that the POMDP-based framework is both a tractable and powerful approach to building more robust spoken dialogue systems.
Statistical dialogue systems, POMDP, Hidden Information State model
0
-
38浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Unsupervised training and directed manual transcription for LVCSR
Speech Communication,2010,52(7-8):652-663
2010年08月01日
A significant cost in obtaining acoustic training data is the generation of accurate transcriptions. When no transcription is available, unsupervised training techniques must be used. Furthermore, the use of discriminative training has become a standard feature of state-of-the-art large vocabulary continuous speech recognition (LVCSR) system. In unsupervised training, unlabelled data are recognised using a seed model and the hypotheses from the recognition system are used as transcriptions for training. In contrast to maximum likelihood training, the performance of discriminative training is more sensitive to the quality of the transcriptions. One approach to deal with this issue is data selection, where only well recognised data are selected for training. More effectively, as the key contribution of this work, an active learning technique, directed manual transcription, can be used. Here a relatively small amount of poorly recognised data is manually transcribed to supplement the automatic transcriptions. Experiments show that using the data selection approach for discriminative training yields disappointing performance improvement on the data which is mismatched to the training data type of the seed model. However, using the directed manual transcription approach can yield significant improvements in recognition accuracy on all types of data.
Unsupervised training, Discriminative training, Automatic transcription, Data selection
0
-
43浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis
IEEE Transactions on Audio, Speech, and Language Processing,2010,19(5):1071 - 107
2010年09月16日
The modeling of fundamental frequency, or F0, in HMM-based speech synthesis is a critical factor in delivering speech which is both natural and accurately conveys all of the many nuances of the message. However, F0 modeling is difficult because F0 values are normally considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. F0 is therefore a discontinuous function of time. Multi-space probability distribution HMM (MSDHMM) is a widely used solution to this problem. The MSDHMM essentially uses a joint distribution of discrete voicing labels and the discontinuous F0 observations. However, due to the discontinuity assumption, the MSDHMM provides a rather weak F0 trajectory model. In this paper, F0 is viewed as being a continuous function of time and this is achieved by assuming that F0 can be observed within unvoiced regions as well as voiced regions. This provides a continuous F0 data stream which can be modeled by standard HMMs. Voicing labels are modeled either implicitly or explicitly in order to perform voicing classification and a globally tied distribution (GTD) technique is used to achieve robust F0 estimation. Both objective measures and subjective listening tests demonstrate that continuous F0 modeling yields better synthesized F0 trajectories and significant improvements to the naturalness of synthesized speech compared to using the MSDHMM model.
无
0
-
44浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
Speech Communication,2011,53(6):914-923
2011年07月01日
To achieve natural high quality synthesized speech in HMM-based speech synthesis, the effective modelling of complex acoustic and linguistic contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based parameter clustering to model the full combinatorial of contexts. However, weak contexts, such as word-level emphasis in natural speech, are difficult to capture using this approach. Also, due to combinatorial explosion, incorporating new contexts within the traditional framework may easily lead to the problem of insufficient data coverage. To effectively model weak contexts and reduce the data sparsity problem, different types of contexts should be treated independently. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and transforms represent the additional effects of weak contexts. In contrast to speaker adaptive training in speech recognition, separate decision trees have to be built for different types of context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. However, the MLLR based system achieved the best performance.
HMM-based speech synthesis, Context adaptive training, Factorized decision tree, State clustering
0
-
33浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用
【期刊论文】Deep feature for text-dependent speaker verification
Speech Communication,2015,73():1-13
2015年10月01日
Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as deep features for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear discriminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline.
Text-dependent speaker verification, Deep neural networks, Deep features, RSR2015
0
-
35浏览
-
0点赞
-
0收藏
-
0分享
-
0下载
-
0评论
-
引用