站内网络爬虫的设计与实现
首发时间:2009-09-07
摘要:搜索引擎一直专注于提升用户的体验度,其用户体验度则反映在三个方面:准、全、快。用专业术语讲是:查准率、查全率和搜索速度(即搜索耗时)。其中最易达到的是搜索速度,因为对于搜索耗时在1秒以下的系统来说,访问者很难辨别其快慢了,所以不做讨论。对于中文搜索引擎的“准”,救市要保证搜索结果前几十项就要给出用户想要的信息,这个涉及到“网页排序”也不在本文的讨论范围;对于中文搜索引擎的“全”则需保证不遗漏某些重要的结果,而且能找到最新的网页,这需要搜索引擎有一个强大的网页收集器,一般称为“网络爬虫”,或是“网络蜘蛛”。本文给出了一种站内网络爬虫设计实现方法,是以一定的策略实现从互联网上爬取各种各样的网络资源,并将爬取得到网页进行正确的解析从而将网页保存到本地的网页库中以便检索,这种爬虫为实现基于主题的第四代搜索引擎奠定了资源基础。
For information in English, please click here
The Design and Implementation of web crawler in web-station
Abstract:Search engine has always been focused on improving user’s experiences; the degree of user’s experiences is reflected in three aspects: accuracy, completeness and quickness. Use a jargon saying: precision ratio, the recall ratio and the search speed (i.e. searching-time). One of the most easily achieved aspects is searching speed, because it is difficult for visitors to recognize the speed of a system whose searching time is below 1 second, then it is not being discussed here. For the accuracy of Chinese search engine, the save market must ensure the wanted information will be given to the users ahead of the search results items. This area refers to “the sequence of web pages”, so it is not being discussed in this paper, too. For completeness, some important information must be ensured no missing and latest web pages can be found. This requires the search engine owns a powerful web page collector, which, usually called “web crawler” or “web spider”.This paper presents a method of designing inside-station web crawler. This method which based on certain strategy of crawling various network resources from the Internet can analyze the crawled web pages precisely and then store them into local web page library for searching. This kind of web crawler provides resource bases for realizing the thematically-based, forth-generation search engine.
Keywords: web crawler crawler downloader crawler updater
基金:
论文图表:
引用
No.3494448195912523****
同行评议
共计0人参与
勘误表
站内网络爬虫的设计与实现
评论
全部评论0/1000