中国科技论文在线

上传时间

2008年03月24日

【期刊论文】A Unified Tagging Approach to Text Normalization

唐杰， Conghui Zhu， Jie Tang， Hang Li， Hwee Tou Ng， Tie-Jun Zhao

，-0001，（）：

-1年11月30日

This paper addresses the issue of text normalization, an important yet often overlooked problem in natural language processing. By text normalization, we mean converting ‘informally inputted’ text into the canonical form, by eliminating ‘noises’ in the text and detecting paragraph and sentence boundaries in the text. Previously, text normalization issues were often undertaken in an ad-hoc fashion or studied separately. This paper first gives a formalization of the entire problem. It then proposes a unified tagging approach to perform the task using Conditional Random Fields (CRF). The paper shows that with the introduction of a small set of tags, most of the text normalization tasks can be performed within the approach. The accuracy of the proposed method is high, because the subtasks of normalization are interdependent and should be performed together. Experimental results on email data cleaning show that the proposed method significantly outperforms the approach of using cascaded models and that of employing independent models.

41浏览
0点赞
0收藏
0分享
158下载
0

引用

上传时间

2008年03月24日

【期刊论文】Social Network Extraction of Academic Researchers

唐杰， Jie Tang， Duo Zhang， and Limin Yao

，-0001，（）：

-1年11月30日

摘要

This paper addresses the issue of extraction of an academic researcher social network. By researcher social network extraction, we are aimed at finding, extracting, and fusing the ‘semantic’-based profiling information of a researcher from the Web. Previously, social network extraction was often undertaken separately in an ad-hoc fashion. This paper first gives a formalization of the entire problem. Specifically, it identifies the ‘relevant documents’ from the Web by a classifier. It then proposes a unified approach to perform the researcher profiling using Conditional Random Fields (CRF). It integrates publications from the existing bibliography datasets. In the integration, it proposes a constraints-based probabilistic model to name disambiguation. Experimental results on an online system show that the unified approach to researcher profiling significantly outperforms the baseline methods of using rule learning or classification. Experimental results also indicate that our method to name disambiguation performs better than the baseline method using unsupervised learning. The methods have been applied to expert finding. Experiments show that the accuracy of expert finding can be significantly improved by using the proposed methods.

50浏览
0点赞
0收藏
0分享
149下载
0

引用

上传时间

2008年03月24日

【期刊论文】Email Data Cleaning

唐杰， Jie Tang， Hang Li， Yunbo Cao， Zhaohui Tang

，-0001，（）：

-1年11月30日

摘要

Addressed in this paper is the issue of ‘email data cleaning’ for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.

Text Mining， Data Cleaning， Email Processing， StatisticalLearning

79浏览
0点赞
0收藏
0分享
551下载
0

引用

上传时间

2008年03月24日

【期刊论文】1A Mixture Model for Expert Finding

唐杰， Jing Zhang， Jie Tang， Liu Liu， and Juanzi Li

，-0001，（）：

-1年11月30日

摘要

This paper addresses the issue of identifying persons with expertise knowledge on a given topic. Traditional methods usually estimate the relevance between the query and the support documents of candidate experts using, for example, a language model. However, the language model lacks the ability of identifying semantic knowledge, thus results in some right experts cannot be found due to not occurrence of the query terms in the support documents. In this paper, we propose a mixture model based on Probabilistic Latent Semantic Analysis (PLSA) to estimate a hidden semantic theme layer between the terms and the support documents. The hidden themes are used to capture the semantic relevance between the query and the experts. We evaluate our mixture model in a real-world system, ArnetMiner 2. Experimental results indicate that the proposed model outperforms the language models.

34浏览
0点赞
0收藏
0分享
145下载
0

引用

上传时间

2008年03月24日

【期刊论文】A Constraint-Based Probabilistic Framework for Name Disambiguation

唐杰， Duo Zhang， Jie Tang， Juanzi Li， and Kehong Wang

，-0001，（）：

-1年11月30日

摘要

This paper is concerned with the problem of name disambiguation. By name disambiguation, we mean distinguishing persons with the same name. It is a critical problem in many knowledge management applications. Despite much research work has been conducted, the problem is still not resolved and becomes even more serious, in particular with the popularity of Web 2.0. Previously, name disambiguation was often undertaken in either a supervised or unsupervised fashion. This paper first gives a constraint-based probabilistic model for semi-supervised name disambiguation. Specifically, we focus on investigating the problem in an academic researcher social network (http: //arnetminer.org). The framework combines constraints and Euclidean distance learning, and allows the user to refine the disambiguation results. Experimental results on the researcher social network show that the proposed framework significantly outperforms the baseline method using unsupervised hierarchical clustering algorithm.

Name Disambiguation， Social Network Analysis， Digital Library， Semi-supervised Clustering

182浏览
0点赞
0收藏
0分享
359下载
0

引用