唐杰
博士研究生 教授 博士生导师
清华大学 计算机科学与技术系
主要从事人工智能、社交网络、数据挖掘、机器学习、知识图谱等领域的研究工作,创新性研究包括社会影响力分析、社会网络用户行为建模、网络行为建模和影响力分析。
有志者,事竟成。
- 姓名:唐杰
- 目前身份:在职研究人员
- 担任导师情况:博士生导师
- 学位:
-
学术头衔:
博士生导师, 国家杰出青年科学基金获得者
- 职称:高级-教授
-
学科领域:
计算机软件
- 研究兴趣:主要从事人工智能、社交网络、数据挖掘、机器学习、知识图谱等领域的研究工作,创新性研究包括社会影响力分析、社会网络用户行为建模、网络行为建模和影响力分析。
唐杰,清华大学计算机科学与技术系教授、博士生导师。
于2006年6月在清华大学计算机系获得工学博士学位,随后留校任教。
主要从事人工智能、社交网络、数据挖掘、机器学习、知识图谱等领域的研究工作,创新性研究包括社会影响力分析、社会网络用户行为建模、网络行为建模和影响力分析。应用上述研究成果,研发了完全自主知识产权的科技情报大数据挖掘与服务平台AMiner。系统2006年上线以来,已收录论文文献2.3亿、学者1.3亿,为全球学者免费提供学者、文献等科技信息资源检索及学者评价/推荐、技术发展趋势分析、产业发展报告等特色服务。吸引了来自全球220个国家/地区的1000多万次独立IP访问。主持国家杰出青年基金项目、国家自然科学基金与英国皇家学会联合基金项目、863课题、国家优秀青年基金项目、国家自然科学基金重点课题等科研项目。获得2017年北京市科学技术奖一等奖(排名1)、2016年微软亚洲研究院合作研究奖、2015年牛顿高级学者基金、2013年中国人工智能学会(CAAI)吴文俊人工智能科学技术(进步)一等奖(排名1)、2012年CCF青年科学家奖、2012年NSFC优秀青年学者、2011年北京科技新星。
兼任TKDD执行主编;IEEE TKDE, ACM TIST, IEEE TBD, Science China编委;中国计算机学会中文信息技术专业委员会副主任;中文信息学会社会媒体处理专业委员会副主任、秘书长;KDD 2018大会副主席;WWW 2018, CIKM 2016, WSDM 2015, ASONAM 2015程序委员会主席。
-
主页访问
4947
-
关注数
0
-
成果阅读
689
-
成果数
10
【期刊论文】A Constraint-Based Probabilistic Framework for Name Disambiguation
唐杰, Duo Zhang, Jie Tang, Juanzi Li, and Kehong Wang
,-0001,():
-1年11月30日
This paper is concerned with the problem of name disambiguation. By name disambiguation, we mean distinguishing persons with the same name. It is a critical problem in many knowledge management applications. Despite much research work has been conducted, the problem is still not resolved and becomes even more serious, in particular with the popularity of Web 2.0. Previously, name disambiguation was often undertaken in either a supervised or unsupervised fashion. This paper first gives a constraint-based probabilistic model for semi-supervised name disambiguation. Specifically, we focus on investigating the problem in an academic researcher social network (http: //arnetminer.org). The framework combines constraints and Euclidean distance learning, and allows the user to refine the disambiguation results. Experimental results on the researcher social network show that the proposed framework significantly outperforms the baseline method using unsupervised hierarchical clustering algorithm.
Name Disambiguation, Social Network Analysis, Digital Library, Semi-supervised Clustering
-
182浏览
-
0点赞
-
0收藏
-
0分享
-
359下载
-
0评论
-
引用
【期刊论文】1A Mixture Model for Expert Finding
唐杰, Jing Zhang, Jie Tang, Liu Liu, and Juanzi Li
,-0001,():
-1年11月30日
This paper addresses the issue of identifying persons with expertise knowledge on a given topic. Traditional methods usually estimate the relevance between the query and the support documents of candidate experts using, for example, a language model. However, the language model lacks the ability of identifying semantic knowledge, thus results in some right experts cannot be found due to not occurrence of the query terms in the support documents. In this paper, we propose a mixture model based on Probabilistic Latent Semantic Analysis (PLSA) to estimate a hidden semantic theme layer between the terms and the support documents. The hidden themes are used to capture the semantic relevance between the query and the experts. We evaluate our mixture model in a real-world system, ArnetMiner 2. Experimental results indicate that the proposed model outperforms the language models.
-
34浏览
-
0点赞
-
0收藏
-
0分享
-
145下载
-
0评论
-
引用
【期刊论文】A Unified Tagging Approach to Text Normalization
唐杰, Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, Tie-Jun Zhao
,-0001,():
-1年11月30日
This paper addresses the issue of text normalization, an important yet often overlooked problem in natural language processing. By text normalization, we mean converting ‘informally inputted’ text into the canonical form, by eliminating ‘noises’ in the text and detecting paragraph and sentence boundaries in the text. Previously, text normalization issues were often undertaken in an ad-hoc fashion or studied separately. This paper first gives a formalization of the entire problem. It then proposes a unified tagging approach to perform the task using Conditional Random Fields (CRF). The paper shows that with the introduction of a small set of tags, most of the text normalization tasks can be performed within the approach. The accuracy of the proposed method is high, because the subtasks of normalization are interdependent and should be performed together. Experimental results on email data cleaning show that the proposed method significantly outperforms the approach of using cascaded models and that of employing independent models.
-
39浏览
-
0点赞
-
0收藏
-
0分享
-
158下载
-
0评论
-
引用
【期刊论文】Arnetminer: expertise oriented search using social networks
唐杰, Juanzi LI, Jie TANG, Jing ZHANG, Qiong LUO, Yunhao LIU, Mingcai HONG
Front. Comput. Sci. China ,-0001,():
-1年11月30日
Expertise Oriented Search (EOS) aims at providing comprehensive expertise analysis on data from distributed sources. It is useful in many application domains, for example, finding experts on a given topic, detecting the confliction of interest between researchers, and assigning reviewers to proposals. In this paper, we present the design and implementation of our expertise oriented search system, Arnetminer (http: //www.arnetminer.net). Arnetminer has gathered and integrated information about a half-million computer science researchers from the Web, including their profiles and publications. Moreover, Arnetminer constructs a social network among these researchers through their co-authorship, and utilizes this network information as well as the individual profiles to facilitate expertise oriented search tasks. In particular, the co-authorship information is used both in ranking the expertise of individual researchers for a given topic and in searching for associations between researchers. We have conducted initial experiments on Arnetminer. Our results demonstrate that the proposed relevancy propagation expert finding method outperforms the method that only uses person local information, and the proposed twostage association search on a large-scale social network is orders of magnitude faster than the baseline method.
social network, expertise search, association search
-
100浏览
-
0点赞
-
0收藏
-
0分享
-
284下载
-
0评论
-
引用
唐杰, Jie Tang, Hang Li, Yunbo Cao, Zhaohui Tang
,-0001,():
-1年11月30日
Addressed in this paper is the issue of ‘email data cleaning’ for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.
Text Mining, Data Cleaning, Email Processing, StatisticalLearning
-
79浏览
-
0点赞
-
0收藏
-
0分享
-
551下载
-
0评论
-
引用
【期刊论文】1iASA: Learning to Annotate the Semantic Web
唐杰, Jie Tang, Juanzi Li, Hongjun Lu, Bangyong Liang, Xiaotong Huang, Kehong Wang
,-0001,():
-1年11月30日
With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that learns to automatically annotate web documents according to an ontology. iASA is based on the combination of information extraction (specifically, the Similarity-based Rule Learner—SRL) and machine learning techniques. Using linguistic knowledge and optimal dynamic window size, SRL produces annotation rules of better quality than comparable semantic annotation systems. Similarity-based learning efficiently reduces the search space by avoiding pseudo rule generalization. In the annotation phase, iASA exploits ontology knowledge to refine the annotation it proposes. Moreover, our annotation algorithm exploits machine learning methods to correctly select instances and to predict missing instances. Finally, iASA provides an explanation component that explains the nature of the learner and annotator to the user. Explanations can greatly help users understand the rule induction and annotation process, so that they can focus on correcting rules and annotations quickly. Experimental results show that iASA can reach high accuracy quickly.
-
61浏览
-
0点赞
-
0收藏
-
0分享
-
135下载
-
0评论
-
引用
【期刊论文】Chapter I Information Extraction: Methodologies and Applications
唐杰, Jie Tang, Mingcai Hong, Duo Zhang, Juanzi Li, Bangyong Liang
,-0001,():
-1年11月30日
This chapter is concerned with the methodologies and applications of information extraction. Information is hidden in the large volume of Web pages and thus it is necessary to extract useful information from the Web content, called information extraction. In information extraction, given a sequence of instances, we identify and pull out a subsequence of the input that represents information we are interested in. In the past years, there was a rapid expansion of activities in the information extraction area. Many methods have been proposed for automating the process of extraction. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still presents many challenging research problems. In this chapter, we will investigate the problems of information extraction and survey existing methodologies for solving these problems. Several real-world applications of information extraction will be introduced. Emerging challenges will be discussed.
-
39浏览
-
0点赞
-
0收藏
-
0分享
-
285下载
-
0评论
-
引用
【期刊论文】Social Network Extraction of Academic Researchers
唐杰, Jie Tang, Duo Zhang, and Limin Yao
,-0001,():
-1年11月30日
This paper addresses the issue of extraction of an academic researcher social network. By researcher social network extraction, we are aimed at finding, extracting, and fusing the ‘semantic’-based profiling information of a researcher from the Web. Previously, social network extraction was often undertaken separately in an ad-hoc fashion. This paper first gives a formalization of the entire problem. Specifically, it identifies the ‘relevant documents’ from the Web by a classifier. It then proposes a unified approach to perform the researcher profiling using Conditional Random Fields (CRF). It integrates publications from the existing bibliography datasets. In the integration, it proposes a constraints-based probabilistic model to name disambiguation. Experimental results on an online system show that the unified approach to researcher profiling significantly outperforms the baseline methods of using rule learning or classification. Experimental results also indicate that our method to name disambiguation performs better than the baseline method using unsupervised learning. The methods have been applied to expert finding. Experiments show that the accuracy of expert finding can be significantly improved by using the proposed methods.
-
50浏览
-
0点赞
-
0收藏
-
0分享
-
149下载
-
0评论
-
引用
【期刊论文】Tree-structured Conditional Random Fields for Semantic Annotation
唐杰, Jie Tang, Mingcai Hong, Juanzi Li, and Bangyong Liang
,-0001,():
-1年11月30日
The large volume of web content needs to be annotated by ontologies (called Semantic Annotation), and our empirical study shows that strong dependencies exist across different types of information (it means that identification of one kind of information can be used for identifying the other kind of information). Conditional Random Fields (CRFs) are the state-of-the-art approaches for modeling the dependencies to do better annotation. However, as information on a Web page is not necessarily linearly laid-out, the previous linear-chain CRFs have their limitations in semantic annotation. This paper is concerned with semantic annotation on hierarchically dependent data (hierarchical semantic annotation). We propose a Tree-structured Conditional Random Field (TCRF) model to better incorporate dependencies across the hierarchically laid-out information. Methods for performing the tasks of model-parameter estimation and annotation in TCRFs have been proposed. Experimental results indicate that the proposed TCRFs for hierarchical semantic annotation can significantly outperform the existing linear-chain CRF model.
-
62浏览
-
0点赞
-
0收藏
-
0分享
-
106下载
-
0评论
-
引用
【期刊论文】Using Bayesian decision for ontology mapping
唐杰, Jie Tang, Juanzi Li, Bangyong Liang, Xiaotong Huang, Yi Li, Kehong Wang
Web Semantics: Science, Services and Agents on the World Wide Web 4 (2006) 243–262,-0001,():
-1年11月30日
Ontology mapping is the key point to reach interoperability over ontologies. In semantic web environment, ontologies are usually distributed and heterogeneous and thus it is necessary to find the mapping between them before processing across them. Many efforts have been conducted to automate the discovery of ontology mapping. However, some problems are still evident. In this paper, ontology mapping is formalized as a problem of decision making. In this way, discovery of optimal mapping is cast as finding the decision with minimal risk. An approach called Risk Minimization based Ontology Mapping (RiMOM) is proposed, which automates the process of discoveries on 1: 1, n: 1, 1: null and null: 1 mappings. Based on the techniques of normalization and NLP, the problem of instance heterogeneity in ontology mapping is resolved to a certain extent. To deal with the problem of name conflict in mapping process, we use thesaurus and statistical technique. Experimental results indicate that the proposed method can significantly outperform the baseline methods, and also obtains improvement over the existing methods.
Ontology mapping, Semantic web, Bayesian decision, Ontology interoperability
-
43浏览
-
0点赞
-
0收藏
-
0分享
-
121下载
-
0评论
-
引用