中国科技论文在线

上传时间

2008年03月24日

【期刊论文】Using Bayesian decision for ontology mapping

唐杰， Jie Tang， Juanzi Li， Bangyong Liang， Xiaotong Huang， Yi Li， Kehong Wang

Web Semantics: Science, Services and Agents on the World Wide Web 4 (2006) 243–262，-0001，（）：

-1年11月30日

Ontology mapping is the key point to reach interoperability over ontologies. In semantic web environment, ontologies are usually distributed and heterogeneous and thus it is necessary to find the mapping between them before processing across them. Many efforts have been conducted to automate the discovery of ontology mapping. However, some problems are still evident. In this paper, ontology mapping is formalized as a problem of decision making. In this way, discovery of optimal mapping is cast as finding the decision with minimal risk. An approach called Risk Minimization based Ontology Mapping (RiMOM) is proposed, which automates the process of discoveries on 1: 1, n: 1, 1: null and null: 1 mappings. Based on the techniques of normalization and NLP, the problem of instance heterogeneity in ontology mapping is resolved to a certain extent. To deal with the problem of name conflict in mapping process, we use thesaurus and statistical technique. Experimental results indicate that the proposed method can significantly outperform the baseline methods, and also obtains improvement over the existing methods.

Ontology mapping， Semantic web， Bayesian decision， Ontology interoperability

43浏览
0点赞
0收藏
0分享
121下载
0

引用

上传时间

2008年03月24日

【期刊论文】Tree-structured Conditional Random Fields for Semantic Annotation

唐杰， Jie Tang， Mingcai Hong， Juanzi Li， and Bangyong Liang

，-0001，（）：

-1年11月30日

摘要

The large volume of web content needs to be annotated by ontologies (called Semantic Annotation), and our empirical study shows that strong dependencies exist across different types of information (it means that identification of one kind of information can be used for identifying the other kind of information). Conditional Random Fields (CRFs) are the state-of-the-art approaches for modeling the dependencies to do better annotation. However, as information on a Web page is not necessarily linearly laid-out, the previous linear-chain CRFs have their limitations in semantic annotation. This paper is concerned with semantic annotation on hierarchically dependent data (hierarchical semantic annotation). We propose a Tree-structured Conditional Random Field (TCRF) model to better incorporate dependencies across the hierarchically laid-out information. Methods for performing the tasks of model-parameter estimation and annotation in TCRFs have been proposed. Experimental results indicate that the proposed TCRFs for hierarchical semantic annotation can significantly outperform the existing linear-chain CRF model.

63浏览
0点赞
0收藏
0分享
106下载
0

引用

上传时间

2008年03月24日

【期刊论文】Social Network Extraction of Academic Researchers

唐杰， Jie Tang， Duo Zhang， and Limin Yao

，-0001，（）：

-1年11月30日

摘要

This paper addresses the issue of extraction of an academic researcher social network. By researcher social network extraction, we are aimed at finding, extracting, and fusing the ‘semantic’-based profiling information of a researcher from the Web. Previously, social network extraction was often undertaken separately in an ad-hoc fashion. This paper first gives a formalization of the entire problem. Specifically, it identifies the ‘relevant documents’ from the Web by a classifier. It then proposes a unified approach to perform the researcher profiling using Conditional Random Fields (CRF). It integrates publications from the existing bibliography datasets. In the integration, it proposes a constraints-based probabilistic model to name disambiguation. Experimental results on an online system show that the unified approach to researcher profiling significantly outperforms the baseline methods of using rule learning or classification. Experimental results also indicate that our method to name disambiguation performs better than the baseline method using unsupervised learning. The methods have been applied to expert finding. Experiments show that the accuracy of expert finding can be significantly improved by using the proposed methods.

50浏览
0点赞
0收藏
0分享
149下载
0

引用

上传时间

2008年03月24日

【期刊论文】Email Data Cleaning

唐杰， Jie Tang， Hang Li， Yunbo Cao， Zhaohui Tang

，-0001，（）：

-1年11月30日

摘要

Addressed in this paper is the issue of ‘email data cleaning’ for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.

Text Mining， Data Cleaning， Email Processing， StatisticalLearning

79浏览
0点赞
0收藏
0分享
551下载
0

引用

上传时间

2008年03月24日

【期刊论文】Chapter I Information Extraction: Methodologies and Applications

唐杰， Jie Tang， Mingcai Hong， Duo Zhang， Juanzi Li， Bangyong Liang

，-0001，（）：

-1年11月30日

摘要

This chapter is concerned with the methodologies and applications of information extraction. Information is hidden in the large volume of Web pages and thus it is necessary to extract useful information from the Web content, called information extraction. In information extraction, given a sequence of instances, we identify and pull out a subsequence of the input that represents information we are interested in. In the past years, there was a rapid expansion of activities in the information extraction area. Many methods have been proposed for automating the process of extraction. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still presents many challenging research problems. In this chapter, we will investigate the problems of information extraction and survey existing methodologies for solving these problems. Several real-world applications of information extraction will be introduced. Emerging challenges will be discussed.

39浏览
0点赞
0收藏
0分享
285下载
0

引用