您当前所在位置: 首页 > 学者

万小军

  • 92浏览

  • 0点赞

  • 0收藏

  • 0分享

  • 354下载

  • 0评论

  • 引用

期刊论文

Towards a unified approach to document similarity search using manifold-ranking of blocks

万小军Xiaojun Wan Jianwu Yang Jianguo Xiao

Information Processing and Management xxx (2007) xxx–xxx,-0001,():

URL:

摘要/描述

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.

关键词: D o c u m e n t similarity search Web similarity search Manifold-ranking D o c u m e n t segmentation Web page segmentation

【免责声明】以下全部内容由[万小军]上传于[2008年03月24日 15时02分59秒],版权归原创者所有。本文仅代表作者本人观点,与本网站无关。本网站对文中陈述、观点判断保持中立,不对所包含内容的准确性、可靠性或完整性提供任何明示或暗示的保证。请读者仅作参考,并请自行承担全部责任。

我要评论

全部评论 0

本学者其他成果

    同领域成果