俞凯，学者主页-中国科技论文在线

俞凯

博士教授博士生导师

上海交通大学计算机科学与工程系

长期从事人工智能、智能语音及语言处理及机器学习的研究和产业化工作。

个性化签名

TA的关注(0) 关注TA的(1)

留言板

暂无留言

主页成果学术会议学者精选辑更多功能敬请期待

姓名：俞凯
目前身份：在职研究人员
担任导师情况：博士生导师
学位：博士
学术头衔：

博士生导师
职称：高级-教授
学科领域：

人工智能
研究兴趣：长期从事人工智能、智能语音及语言处理及机器学习的研究和产业化工作。

个人简介

俞凯，苏州思必驰信息科技有限公司首席科学家，上海交通大学计算机科学与工程系研究员，上海交通大学苏州人工智能研究院执行院长。

清华大学自动化系本科、硕士，剑桥大学工程系博士。长期从事人工智能、智能语音及语言处理及机器学习的研究和产业化工作。研究兴趣涉及语音识别、语音合成、口语理解、对话系统、认知型人机交互等智能语音语言处理技术的多个核心技术领域，发表国际期刊和会议论文120余篇，获得国际语音通信联盟（ISCA）2008-2012 Computer Speech and Language Best Paper Award等4篇国际期刊和会议最优论文奖，受邀担任InterSpeech、EUSIPCO等国际会议语音识别、口语对话系统领域主席。他搭建的大规模连续语音识别系统曾获得美国国家标准局（NIST）和美国国防部内部评测冠军；负责设计实现的认知型统计对话系统获得对话系统国际挑战赛可控测试冠军等。2014年获得中国人工智能学会颁发的“吴文俊人工智能科学技术奖”进步奖；获评“2016科学中国人年度人物”。他创立思必驰公司，进行智能语音及对话技术的产业化。作为中国人工智能领域创业公司的代表，思必驰入选2016高盛全球人工智能报告“AI Key Players”及2017年Gartner“Cool Vendors for AI”。

他是IEEE高级会员，目前中国大陆高校唯一的IEEE Speech and Language Processing Technical Committee 委员，中国计算机学会语音对话及听觉专业组副主任，中国声学学会语音语言、听觉及音乐分会执委会委员，中国人工智能产业发展联盟学术和知识产权组组长，中国语音产业联盟技术工作组副组长。

主页访问

102
关注数

1
成果阅读

978
成果数

22

TA的成果

上传时间

2020-10-30

【期刊论文】Kernel Nearest-Neighbor Algorithm

Neural Processing Letters ，2002，15（）：147–156

2002年04月01日

摘要

The ‘kernel approach’ has attracted great attention with the development of support vector machine (SVM) and has been studied in a general way. It offers an alternative solution to increase the computational power of linear learning machines by mapping data into a high dimensional feature space. This ‘approach’ is extended to the well-known nearest-neighbor algorithm in this paper. It can be realized by substitution of a kernel distance metric for the original one in Hilbert space, and the corresponding algorithm is called kernel nearest-neighbor algorithm. Three data sets, an artificial data set, BUPA liver disorders database and USPS database, were used for testing. Kernel nearest-neighbor algorithm was compared with conventional nearest-neighbor algorithm and SVM Experiments show that kernel nearest-neighbor algorithm is more powerful than conventional nearest-neighbor algorithm, and it can compete with SVM.

无

50浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Karyotyping of comparative genomic hybridization human metaphases using kernel nearest‐neighbor algorithm

Cytometry，2002，48（4）：202-208

2002年07月26日

摘要

Background Comparative genomic hybridization (CGH) is a relatively new molecular cytogenetic method that detects chromosomal imbalances. Automatic karyotyping is an important step in CGH analysis because the precise position of the chromosome abnormality must be located and manual karyotyping is tedious and time‐consuming. In the past, computer‐aided karyotyping was done by using the 4′,6‐diamidino‐2‐phenylindole, dihydrochloride (DAPI)‐inverse images, which required complex image enhancement procedures. Methods An innovative method, kernel nearest‐neighbor (K‐NN) algorithm, is proposed to accomplish automatic karyotyping. The algorithm is an application of the “kernel approach,” which offers an alternative solution to linear learning machines by mapping data into a high dimensional feature space. By implicitly calculating Euclidean or Mahalanobis distance in a high dimensional image feature space, two kinds of K‐NN algorithms are obtained. New feature extraction methods concerning multicolor information in CGH images are used for the first time. Results Experiment results show that the feature extraction method of using multicolor information in CGH images improves greatly the classification success rate. A high success rate of about 91.5% has been achieved, which shows that the K‐NN classifier efficiently accomplishes automatic chromosome classification from relatively few samples. Conclusions The feature extraction method proposed here and K‐NN classifiers offer a promising computerized intelligent system for automatic karyotyping of CGH human chromosomes. Cytometry 48:202–208, 2002. © 2002 Wiley‐Liss, Inc.

无

50浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Discriminative cluster adaptive training

IEEE Transactions on Audio, Speech, and Language Processing，2006，14（5）：1694 - 170

2006年08月21日

摘要

Multiple-cluster schemes, such as cluster adaptive training (CAT) or eigenvoice systems, are a popular approach for rapid speaker and environment adaptation. Interpolation weights are used to transform a multiple-cluster, canonical, model to a standard hidden Markov model (HMM) set representative of an individual speaker or acoustic environment. Maximum likelihood training for CAT has previously been investigated. However, in state-of-the-art large vocabulary continuous speech recognition systems, discriminative training is commonly employed. This paper investigates applying discriminative training to multiple-cluster systems. In particular, minimum phone error (MPE) update formulae for CAT systems are derived. In order to use MPE in this case, modifications to the standard MPE smoothing function and the prior distribution associated with MPE training are required. A more complex adaptive training scheme combining both interpolation weights and linear transforms, a structured transform (ST), is also discussed within the MPE training framework. Discriminatively trained CAT and ST systems were evaluated on a state-of-the-art conversational telephone speech task. These multiple-cluster systems were found to outperform both standard and adaptively trained systems

无

41浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Bayesian Adaptive Inference and Adaptive Training

IEEE Transactions on Audio, Speech, and Language Processing，2007，15（6）：1932 - 194

2007年07月23日

摘要

Large-vocabulary speech recognition systems are often built using found data, such as broadcast news. In contrast to carefully collected data, found data normally contains multiple acoustic conditions, such as speaker or environmental noise. Adaptive training is a powerful approach to build systems on such data. Here, transforms are used to represent the different acoustic conditions, and then a canonical model is trained given this set of transforms. This paper describes a Bayesian framework for adaptive training and inference. This framework addresses some limitations of standard maximum-likelihood approaches. In contrast to the standard approach, the adaptively trained system can be directly used in unsupervised inference, rather than having to rely on initial hypotheses being present. In addition, for limited adaptation data, robust recognition performance can be obtained. The limited data problem often occurs in testing as there is no control over the amount of the adaptation data available. In contrast, for adaptive training, it is possible to control the system complexity to reflect the available data. Thus, the standard point estimates may be used. As the integral associated with Bayesian adaptive inference is intractable, various marginalization approximations are described, including a variational Bayes approximation. Both batch and incremental modes of adaptive inference are discussed. These approaches are applied to adaptive training of maximum-likelihood linear regression and evaluated on a large-vocabulary speech recognition task. Bayesian adaptive inference is shown to significantly outperform standard approaches.

无

51浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Unsupervised Adaptation With Discriminative Mapping Transforms

IEEE Transactions on Audio, Speech, and Language Processing，2009，17（4）：

2009年05月01日

摘要

The most commonly used approaches to speaker adaptation are based on linear transforms, as these can be robustly estimated using limited adaptation data. Although significant gains can be obtained using discriminative criteria for training acoustic models, maximum-likelihood (ML) estimated transforms are still used for unsupervised adaptation. This is because discriminatively trained transforms are highly sensitive to errors in the adaptation supervision hypothesis. This paper describes a new framework for estimating transforms that are discriminative in nature, but are less sensitive to this hypothesis issue. A speaker-independent discriminative mapping transformation (DMT) is estimated during training. This transform is obtained after a speaker-specific ML-estimated transform of each training speaker has been applied. During recognition an ML speaker-specific transform is found for each test-set speaker and the speaker-independent DMT then applied. This allows a transform which is discriminative in nature to be indirectly estimated, while only requiring an ML speaker-specific transform to be found during recognition. The DMT technique is evaluated on an English conversational telephone speech task. Experiments showed that using DMT in unsupervised adaptation led to significant gains over both standard ML and discriminatively trained transforms.

无

38浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management

Computer Speech & Language，2010，24（2）：150-174

2010年04月01日

摘要

This paper explains how Partially Observable Markov Decision Processes (POMDPs) can provide a principled mathematical framework for modelling the inherent uncertainty in spoken dialogue systems. It briefly summarises the basic mathematics and explains why exact optimisation is intractable. It then describes in some detail a form of approximation called the Hidden Information State model which does scale and which can be used to build practical systems. A prototype HIS system for the tourist information domain is evaluated and compared with a baseline MDP system using both user simulations and a live user trial. The results give strong support to the central contention that the POMDP-based framework is both a tractable and powerful approach to building more robust spoken dialogue systems.

Statistical dialogue systems， POMDP， Hidden Information State model

38浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Unsupervised training and directed manual transcription for LVCSR

Speech Communication，2010，52（7-8）：652-663

2010年08月01日

摘要

A significant cost in obtaining acoustic training data is the generation of accurate transcriptions. When no transcription is available, unsupervised training techniques must be used. Furthermore, the use of discriminative training has become a standard feature of state-of-the-art large vocabulary continuous speech recognition (LVCSR) system. In unsupervised training, unlabelled data are recognised using a seed model and the hypotheses from the recognition system are used as transcriptions for training. In contrast to maximum likelihood training, the performance of discriminative training is more sensitive to the quality of the transcriptions. One approach to deal with this issue is data selection, where only well recognised data are selected for training. More effectively, as the key contribution of this work, an active learning technique, directed manual transcription, can be used. Here a relatively small amount of poorly recognised data is manually transcribed to supplement the automatic transcriptions. Experiments show that using the data selection approach for discriminative training yields disappointing performance improvement on the data which is mismatched to the training data type of the seed model. However, using the directed manual transcription approach can yield significant improvements in recognition accuracy on all types of data.

Unsupervised training， Discriminative training， Automatic transcription， Data selection

43浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

IEEE Transactions on Audio, Speech, and Language Processing，2010，19（5）：1071 - 107

2010年09月16日

摘要

The modeling of fundamental frequency, or F0, in HMM-based speech synthesis is a critical factor in delivering speech which is both natural and accurately conveys all of the many nuances of the message. However, F0 modeling is difficult because F0 values are normally considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. F0 is therefore a discontinuous function of time. Multi-space probability distribution HMM (MSDHMM) is a widely used solution to this problem. The MSDHMM essentially uses a joint distribution of discrete voicing labels and the discontinuous F0 observations. However, due to the discontinuity assumption, the MSDHMM provides a rather weak F0 trajectory model. In this paper, F0 is viewed as being a continuous function of time and this is achieved by assuming that F0 can be observed within unvoiced regions as well as voiced regions. This provides a continuous F0 data stream which can be modeled by standard HMMs. Voicing labels are modeled either implicitly or explicitly in order to perform voicing classification and a globally tied distribution (GTD) technique is used to achieve robust F0 estimation. Both objective measures and subjective listening tests demonstrate that continuous F0 modeling yields better synthesized F0 trajectories and significant improvements to the naturalness of synthesized speech compared to using the MSDHMM model.

无

44浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis

Speech Communication，2011，53（6）：914-923

2011年07月01日

摘要

To achieve natural high quality synthesized speech in HMM-based speech synthesis, the effective modelling of complex acoustic and linguistic contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based parameter clustering to model the full combinatorial of contexts. However, weak contexts, such as word-level emphasis in natural speech, are difficult to capture using this approach. Also, due to combinatorial explosion, incorporating new contexts within the traditional framework may easily lead to the problem of insufficient data coverage. To effectively model weak contexts and reduce the data sparsity problem, different types of contexts should be treated independently. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and transforms represent the additional effects of weak contexts. In contrast to speaker adaptive training in speech recognition, separate decision trees have to be built for different types of context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. However, the MLLR based system achieved the best performance.

HMM-based speech synthesis， Context adaptive training， Factorized decision tree， State clustering

33浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Deep feature for text-dependent speaker verification

Speech Communication，2015，73（）：1-13

2015年10月01日

摘要

Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as deep features for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear discriminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline.

Text-dependent speaker verification， Deep neural networks， Deep features， RSR2015

35浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Sequence discriminative training for deep learning based acoustic keyword spotting

Speech Communication，2018，102（）：100-111

2018年09月01日

摘要

Speech recognition is a sequence prediction problem. Besides employing various deep learning approaches for frame-level classification, sequence-level discriminative training has been proved to be indispensable to achieve the state-of-the-art performance in large vocabulary continuous speech recognition (LVCSR). However, keyword spotting (KWS), as one of the most common speech recognition tasks, almost only benefits from frame-level deep learning due to the difficulty of getting competing sequence hypotheses. The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches. In this paper, a sequence discriminative training framework is proposed for both fixed vocabulary and unrestricted acoustic KWS. Sequence discriminative training for both sequence-level generative and discriminative models are systematically investigated. By introducing word-independent phone lattices or non-keyword blank symbols to construct competing hypotheses, feasible and efficient sequence discriminative training approaches are proposed for acoustic KWS. Experiments showed that the proposed approaches obtained consistent and significant improvement in both fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level deep learning based acoustic KWS methods.

ASR， KWS， Sequence discriminative training， Generative sequence model， Discriminative sequence model

65浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection

IEEE/ACM Transactions on Audio, Speech and Language Processing，2018，26（11）：

2018年11月01日

摘要

Recent advances in automatic speaker verification ASV lead to an increased interest in securing these systems for real-world applications. Malicious spoofing attempts against ASV systems can lead to serious security breaches. A spoofing attack within the context of ASV is a condition in which a potentially harmful person successfully masks as another, to the ASV system already known person by falsifying or manipulating data. While most previous work focuses on enhanced, spoof-aware features, end-to-end models can be a potential alternative. In this paper, we investigate the training of a raw wave front-ends for deep convolutional, long short-term memory LSTM and vanilla neural networks, which are analyzed for their suitability toward spoofing detection, regarding the influence of frame size, number of output neurons, and sequence length. A joint convolutional LSTM neural network CLDNN is proposed, which outperforms previous attempts on the BTAS2016 dataset 0.82% $\rightarrow$ 0.19% HTER, placing itself as the current state-of-the-art model for the dataset. We show that end-to-end approaches are appropriate for the important replay detection task and show that the proposed model is capable of distinguishing device-invariant spoofing attempts. Regarding the ASVspoof2015 dataset, the end-to-end solution achieves an equal error rate EER of 0.00% for the S1-S9 conditions. We show that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions. In addition, a cross-database domain mismatch scenario is also evaluated, which shows that the proposed CLDNN model trained on the BTAS2016 dataset achieves an EER of 25.7% on the ASVspoof2015 dataset.

无

38浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition

IEEE/ACM Transactions on Audio, Speech and Language Processing，2018，26（8）：

2018年08月01日

摘要

Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Our previous work has demonstrated the superior noise robustness of very deep convolutional neural networks VDCNN. Based on our work on VDCNNs, this paper proposes a more advanced model referred to as the very deep convolutional residual network VDCRN. This new model incorporates batch normalization and residual learning, showing more robustness than previous VDCNNs.Then, to alleviate the mismatch between the training and testing conditions, model adaptation and adaptive training are developed and compared for the new VDCRN. This paper focuses on factor aware training FAT and cluster adaptive training CAT. For FAT, a unified framework is explored. For CAT, two schemes are first explored to construct the bases in the canonical model; furthermore, a factorized version of CAT is designed to address multiple nonspeech variabilities in one model. Finally, a complete multipass system is proposed to achieve the best system performance in the noisy scenarios. The proposed new approaches are evaluated on three different tasks: Aurora4 simulated data with additive noise and channel distortion, CHiME4 both simulated and real data with additive noise and reverberation, and the AMI meeting transcription task real data with significant reverberation.The evaluation not only includes different noisy conditions, but also covers both simulated and real noisy data. The experiments show that the new VDCRN is more robust, and the adaptation on this model can further significantly reduce the word error rate WER. The proposed best architecture obtains consistent and very large improvements on all tasks compared to the baseline VDCNN or long short-term memory. Particularly, on Aurora4 a new milestone 5.67% WER is achieved by only improving acoustic modeling.

无

42浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Rich Short Text Conversation Using Semantic-Key-Controlled Sequence Generation

IEEE/ACM Transactions on Audio, Speech, and Language Processing，2018，26（8）： 1359 - 13

2018年03月26日

摘要

With the recent advances of the sequence-to-sequence framework, generation approaches for the short text conversation (STC) become attractive. The traditional sequence-to-sequence approaches for the STC often suffer from poor diversity and general reply without substantiality. It is also hard to control the topic or semantics of the selected reply from multiple generated candidates. In this paper, a novel external-memory-driven sequence-to-sequence learning approach is proposed to address these problems. A tensor of the external memory is constructed to represent interpretable topics or semantics. During generation, a controllable memory trigger is extracted given the input sequence, and a reply is then generated using the memory trigger as well as the sequence-to-sequence model. Experiments show that the proposed approach can generate much richer diversity than the traditional sequence-to-sequence training with attention. Meanwhile, it achieves better quality score in human evaluation. It is also observed that by manually manipulating the memory trigger, it is possible to interpretably guide the topics or semantics of the reply.

无

57浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Phone Synchronous Speech Recognition With CTC Lattices

IEEE/ACM Transactions on Audio, Speech and Language Processing，2017，25（1）：

2017年01月01日

摘要

Connectionist temporal classification CTC has recently shown improved performance and efficiency in automatic speech recognition. One popular decoding implementation is to use a CTC model to predict the phone posteriors at each frame and then perform Viterbi beam search on a modified WFST network. This is still within the traditional frame synchronous decoding framework. In this paper, the peaky posterior property of CTC is carefully investigated and it is found that ignoring blank frames will not introduce additional search errors. Based on this phenomenon, a novel phone synchronous decoding framework is proposed by removing tremendous search redundancy due to blank frames, which results in significant search speed up. The framework naturally leads to an extremely compact phone-level acoustic space representation: CTC lattice. With CTC lattice, efficient and effective modular speech recognition approaches, second pass rescoring for large vocabulary continuous speech recognition LVCSR, and phone-based keyword spotting KWS, are also proposed in this paper. Experiments showed that phone synchronous decoding can achieve 3-4 times search speed up without performance degradation compared to frame synchronous decoding. Modular LVCSR with CTC lattice can achieve further WER improvement. KWS with CTC lattice not only achieved significant equal error rate improvement, but also greatly reduced the KWS model size and increased the search speed.

无

50浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Deep features for automatic spoofing detection

Speech Communication，2016，85（）：43-52

2016年12月01日

摘要

Recently biometric authentication has made progress in areas, such as speaker verification. However, some evidence shows that the technology is susceptible to malicious spoofing attacks, and thus dedicated countermeasures are needed to detect a variety of specific attack types. Inspired by the great success of deep learning in automatic speech recognition, we propose a detailed deep learning based feature engineering framework for spoofing detection in this paper. To incorporate deep learning into spoofing detection, this work proposes novel approaches for extracting and using features from deep learning models. In contrast to the traditional short-term spectral features, such as MFCC or PLP, outputs from the hidden layer of various deep models are employed as deep features for spoofing detection. Two frameworks are developed to extract deep features, including DNN-based frame-level feature extraction and RNN-based sequence-level feature extraction, and several structures are explored within each framework. Once the deep features are extracted, they can be used as a spoofing identity representation for each utterance, and the appropriate back-end classifier is then applied to make the final detection decision. These approaches were evaluated on the ASVspoof2015 Challenge data corpus. Experiments show that deep feature based systems achieve good performance, even without using any designed features such as phase and cochlea features common in spoofing detection, and obtain significant performance improvements compared to the traditional baselines. The EER of the best deep feature system achieves nearly 0.0% for all attack types from S1 to S9, and gets 1.1% on all averaged conditions (plus S10), which is very promising performance in ASVspoof2015 Challenge task.

Automatic spoofing detection， Deep features， Deep neural network， Recurrent neural network

48浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition

IEEE/ACM Transactions on Audio, Speech, and Language Processing，2016，24（12）：2263 - 227

2016年08月25日

摘要

Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Recently, very deep convolutional neural networks (CNNs) have been successfully applied to computer vision and speech recognition tasks. Based on our previous work on very deep CNNs, in this paper this architecture is further developed to improve recognition accuracy for noise robust speech recognition. In the proposed very deep CNN architecture, we study the best configuration for the sizes of filters, pooling, and input feature maps: the sizes of filters and poolings are reduced and dimensions of input features are extended to allow for adding more convolutional layers. Then the appropriate pooling, padding, and input feature map selection strategies are investigated and applied to the very deep CNN to make it more robust for speech recognition. In addition, an in-depth analysis of the architecture reveals key characteristics, such as compact model scale, fast convergence speed, and noise robustness. The proposed new model is evaluated on two tasks: Aurora4 task with multiple additive noise types and channel mismatch, and the AMI meeting transcription task with significant reverberation. Experiments on both tasks show that the proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition. The best architecture obtains a 10.0% relative reduction over the traditional CNN on AMI, competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model. On Aurora4, even without feature enhancement, model adaptation, and sequence training, it achieves a WER of 8.81%, a 17.0% relative improvement over the LSTM-RNN. To our knowledge, this is the best published result on Aurora4.

无

56浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Recurrent Polynomial Network for Dialogue State Tracking

arXiv，2015，（）：

2015年11月21日

摘要

Dialogue state tracking (DST) is a process to estimate the distribution of the dialogue states as a dialogue progresses. Recent studies on constrained Markov Bayesian polynomial (CMBP) framework take the first step towards bridging the gap between rule-based and statistical approaches for DST. In this paper, the gap is further bridged by a novel framework -- recurrent polynomial network (RPN). RPN's unique structure enables the framework to have all the advantages of CMBP including efficiency, portability and interpretability. Additionally, RPN achieves more properties of statistical approaches than CMBP. RPN was evaluated on the data corpora of the second and the third Dialog State Tracking Challenge (DSTC-2/3). Experiments showed that RPN can significantly outperform both traditional rule-based approaches and statistical approaches with similar feature set. Compared with the state-of-the-art statistical DST approaches with a lot richer features, RPN is also competitive.

无

51浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Evolvable dialogue state tracking for statistical dialogue management

Frontiers of Computer Science: Selected Publications from Chinese Universities，2016，10（2）：

2016年04月01日

摘要

Statistical dialogue management is the core of cognitive spoken dialogue systems (SDS) and has attracted great research interest. In recent years, SDS with the ability of evolution is of particular interest and becomes the cuttingedge of SDS research. Dialogue state tracking (DST) is a process to estimate the distribution of the dialogue states at each dialogue turn, given the previous interaction history. It plays an important role in statistical dialogue management. To provide a common testbed for advancing the research of DST, international DST challenges (DSTC) have been organised and well-attended by major SDS groups in the world. This paper reviews recent progresses on rule-based and statistical approaches during the challenges. In particular, this paper is focused on evolvable DST approaches for dialogue domain extension. The two primary aspects for evolution, semantic parsing and tracker, are discussed. Semantic enhancement and a DST framework which bridges rule-based and statistical models are introduced in detail. By effectively incorporating prior knowledge of dialogue state transition and the ability of being data-driven, the new framework supports reliable domain extension with little data and can continuously improve with more data available. Thismakes it excellent candidate for DST evolution. Experiments show that the evolvable DST approaches can achieve the state-of-the-art performance and outperform all previously submitted trackers in the third DSTC.

无

36浏览
0点赞
0收藏
0分享
0下载
0评论
引用

上传时间

2020-10-30

【期刊论文】Cluster Adaptive Training for Deep Neural Network Based Acoustic Model

IEEE/ACM Transactions on Audio, Speech, and Language Processing，2015，24（3）：459 - 468

2015年12月23日

摘要

Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, significant performance degradation has been observed if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous DNN adaptation works mainly focus on adapting parameters of a single DNN by applying linear transformations to feature or hidden-layer output; introducing vector representation of non-speech variability into the input. In these methods, large number of parameters are required to be estimated during adaptation. In this paper, the cluster adaptive training (CAT) framework is employed for DNN adaptive training. Here, multiple weight matrices are constructed to form the basis of a canonical parametric space. During adaptation, for a new acoustic condition, an interpolation vector is estimated to combine the weight basis into a single adapted weight matrix. Since only the interpolation vector need to be estimated during adaptation, the number of updated parameters is much smaller than existing DNN adaptation methods. The CAT-DNN approach was evaluated on an English switchboard task in unsupervised adaptation mode. It achieved significant WER reductions over the unadapted DNN-HMM, relative 7.6% to 10.6%, with only 10 parameters.

无

49浏览
0点赞
0收藏
0分享
0下载
0评论
引用