万小军，学者主页-中国科技论文在线

万小军

个性化签名

TA的关注(0) 关注TA的(0)

留言板

暂无留言

主页成果学术会议学者精选辑更多功能敬请期待

姓名：万小军
目前身份：
担任导师情况：
学位：
学术头衔：
职称：-
学科领域：

计算机应用
研究兴趣：

个人简介

万小军，1979 年生，2000 年7 月在北京大学信息管理系获理学学士，2003 年7 月在北京大学计算机科学技术系获理学硕士学位，2006 年7 月在北京大学信息科学技术学院获博士学位，同年加入北京大学计算机科学技术研究所任助理研究员，2007 年8 月晋升为副研究员。现为ACM 与ACL会员，担任国际会议ACL2008 Summarization Track、WWW2008 Poster Track 程序委员会委员，国际期刊TALIP、KAIS、Information Sciences 审稿人。以第一作者身份发表论文20 多篇(包括SIGIR、AAAI、IJCAI、ACL、COLING、CIKM、WWW等权威国际会议和Information Retrieval, Information Processing & Management、Knowledge and Information Systems等著名国际期刊)，申请发明专利10 项。

主页访问

4868
关注数

0
成果阅读

440
成果数

10

TA的成果

上传时间

2008-03-24

【期刊论文】Single Document Summarization with Document Expansion

万小军， Xiaojun Wan and Jianwu Yang

，-0001，（）：

-1年11月30日

摘要

Existing methods for single document summarization usually make use of only the information contained in the specified document. This paper proposes the technique of document expansion to provide more knowledge to help single document summarization. A specified document is expanded to a small document set by adding a few neighbor documents close to the document, and then the graphranking based algorithm is applied on the expanded document set for extracting sentences from the single document, by making use of both the within-document relationships between sentences of the specified document and the cross-document relationships between sentences of all documents in the document set. The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion. The cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization.

58浏览
0点赞
0收藏
0分享
248下载
0评论
引用

上传时间

2008-03-24

【期刊论文】Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction

万小军， Xiaojun Wan; Jianwu Yang; Jianguo Xiao

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 552–559, Prague, Czech Republic, June 2007，-0001，（）：

-1年11月30日

摘要

Though both document summarization and keyword extraction aim to extract concise representations from documents, these two tasks have usually been investigated independently. This paper proposes a novel iterative reinforcement approach to simultaneously extracting summary and keywords from single document under the assumption that the summary and keywords of a document can be mutually boosted. The approach can naturally make full use of the reinforcement between sentences and keywords by fusing three kinds of relationships between sentences and words, either homogeneous or heterogeneous. Experimental results show the effectiveness of the proposed approach for both tasks. The corpus- based approach is validated to work almost as well as the knowledge-based approach for computing word semantics.

50浏览
0点赞
0收藏
0分享
145下载
0评论
引用

上传时间

2008-03-24

【期刊论文】Person Resolution in Person Search Results: WebHawk

万小军， Xiaojun Wan， Jianfeng Gao， Mu Li， Binggong Ding

CIKM’05, October 31-November 5, 2005, Bremen, Germany，-0001，（）：

-1年11月30日

摘要

Finding information about people on the Web using a search engine is difficult because there is a many-to-many mapping between person names and specific persons (i.e. referents). This paper describes a person resolution system, called WebHawk. Given a list of pages obtained by submitting a person query to a search engine, WebHawk facilitates person search in three steps: First of all, a filter removes those pages that contain no information about any person. Secondly, a cluster groups the remaining pages into different clusters, each for one specific person. To make the resulting clusters more meaningful, an extractor is used to induce query-oriented personal information from each page. Finally, a namer generates an informative description for each cluster so that users can find any specific person easily. The archi-tecture of WebHawk is presented, and the four components are discussed in detail, with a separate evaluation of each component presented where appropriate. A user study shows that WebHawk complements most existing search engines and successfully improves users’ experience of person search on the Web.

Person Resolution， Person Search， Clustering， Junk Filtering

20浏览
0点赞
0收藏
0分享
363下载
0评论
引用

上传时间

2008-03-24

【期刊论文】Using Proportional Transportation Distances for Measuring Document Similarity

万小军， Xiaojun Wan and Jianwu Yang

M. Lalmas et al. (Eds.): ECIR 2006, LNCS 3936, pp. 25 – 36, 2006. ，-0001，（）：

-1年11月30日

摘要

A novel document similarity measure based on the Proportional Transportation Distance (PTD) is proposed in this paper. The proposed measure improves on the previously proposed similarity measure based on optimal matching by allowing many-to-many matching between subtopics of documents. After documents are decomposed into sets of subtopics, the Proportional Transportation Distance is employed to evaluate the similarity between sets of subtopics for two documents by solving a transportation problem. Experiments on TDT-3 data demonstrate its good ability for measuring document similarity and also its high robustness, i.e. it does not rely on the underlying document decomposition algorithm largely as the optimal matching based measure.

35浏览
0点赞
0收藏
0分享
153下载
0评论
引用

上传时间

2008-03-24

【期刊论文】Manifold-Ranking Based Topic-Focused Multi-Document Summarization

万小军， Xiaojun Wan， Jianwu Yang and Jianguo Xiao

，-0001，（）：

-1年11月30日

摘要

Topic-focused multi-document summarization aims to produce a summary biased to a given topic or user profile. This paper presents a novel extractive approach based on manifold-ranking of sentences to this summarization task. The manifold- ranking process can naturally make full use of both the relationships among all the sentences in the documents and the relationships between the given topic and the sentences. The ranking score is obtained for each sentence in the manifold-ranking process to denote the biased information richness of the sentence. Then the greedy algorithm is employed to impose diversity penalty on each sentence. The summary is produced by choosing the sentences with both high biased information richness and high information novelty. Experiments on DUC2003 and DUC2005 are performed and the ROUGE evaluation results show that the proposed approach can significantly outperform existing approaches of the top performing systems in DUC tasks and baseline approaches.

34浏览
0点赞
0收藏
0分享
322下载
0评论
引用

上传时间

2008-03-24

【期刊论文】A novel document similarity measure based on earth mover’s distance

万小军， Xiaojun Wan

Information Sciences 177 (2007) 3718–3730，-0001，（）：

-1年11月30日

摘要

In this paper we propose a novel measure based on the earth mover’s distance (EMD) to evaluate document similarity by allowing many-to-many matching between subtopics. First, each document is decomposed into a set of subtopics, and then the EMD is employed to evaluate the similarity between two sets of subtopics for two documents by solving the transportation problem. The proposed measure is an improvement of the previous OM-based measure, which allows only oneto- one matching between subtopics. Experiments have been performed on the TDT3 dataset to evaluate existing similarity measures and the results show that the EMD-based measure outperforms the optimal matching (OM) based measure and all other measures. In addition to the TextTiling algorithm, the sentence clustering algorithm is adopted for document decomposition, and the experimental results show that the proposed EMD-based measure does not rely on the document decomposition algorithm and thus it is more robust than the OM-based measure.

D， o， c， u， m， e， n， t， similarity measure， D， o， c， u， m， e， n， t， similarity search， Earth mover’s distance， TextTiling， Subtopic structure

24浏览
0点赞
0收藏
0分享
865下载
0评论
引用

上传时间

2008-03-24

【期刊论文】Towards a unified approach to document similarity search using manifold-ranking of blocks

万小军， Xiaojun Wan ， Jianwu Yang， Jianguo Xiao

Information Processing and Management xxx (2007) xxx–xxx，-0001，（）：

-1年11月30日

摘要

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.

D， o， c， u， m， e， n， t， similarity search， Web similarity search， Manifold-ranking， D， o， c， u， m， e， n， t， segmentation， Web page segmentation

92浏览
0点赞
0收藏
0分享
354下载
0评论
引用

上传时间

2008-03-24

【期刊论文】CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document Summarizations

万小军， Xiaojun Wan， Jianwu Yang and Jianguo Xiao

，-0001，（）：

-1年11月30日

摘要

Almost all existing methods conduct the summarization tasks for single documents separately without interactions for each document under the assumption that the documents are considered independent of each other. This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context. In this study, CollabSum is implemented by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarizations within each cluster. Both the with-document and cross-document relationships between sentences are incorporated in the algorithm. Experiments on the DUC2001 and DUC2002 datasets demonstrate the encouraging performance of the proposed approach. Different clustering algorithms have been investigated and we find that the summarization performance relies positively on the quality of document cluster.

CollabSum， Single d， o， c， u， m， e， n， t， summarization， Collaborative summarization， Graph-ranking algorithm

44浏览
0点赞
0收藏
0分享
226下载
0评论
引用

上传时间

2008-03-24

【期刊论文】Using Proportional Transportation Similarity with Learned Element Semantics for XML Document Clustering

万小军， Xiaojun Wan， Jianwu Yang

，-0001，（）：

-1年11月30日

摘要

ML document similarity by taking into account the semantics between XML elements. The motivation of the proposed approach is to overcome the problems of “under-contribution” and “over-contribution” existing in previous work. The element semantics are learned in an unsupervised way and the Proportional Transportation Similarity is proposed to evaluate XML document similarity by modeling the similarity calculation as a transportation problem. Experiments of clustering are performed on three ACM SIGMOD data sets and results show the favorable performance of the proposed approach.

XML d， o， c， u， m， e， n， t， clustering， Proportional Transportation Similarity

37浏览
0点赞
0收藏
0分享
142下载
0评论
引用

上传时间

2008-03-24

【期刊论文】CM-PMI: Improved Web-based Association Measure with Contextual Label Matching

万小军， Xiaojun Wan

，-0001，（）：

-1年11月30日

摘要

WebPMI is a popular web-based association measure to evaluate the semantic similarity between two queries (i.e. words or entities) by leveraging search results returned by search engines. This paper proposes a novel measure named CM-PMI to evaluate query similarity at a finer granularity than WebPMI, under the assumption that a query is usually associated with more than one aspect and two queries are deemed semantically related if their associated aspect sets are highly consistent with each other. CM-PMI first extracts contextual labels from search results to represent the aspects of a query, and then uses the optimal matching method to assess the consistency between the aspects of two queries. Experimental results on the benchmark Miller Charles’ dataset demonstrate the good effectiveness of the proposed CM-PMI measure. Moreover, we further fuse WebPMI and CM-PMI to obtain improved results.

Association measure， Web mining， CM-PMI