站内网络爬虫的设计与实现

马栋; 崔晶晶; 王龙江; 王悦平

0
0
浏览
下载

摘要
关键词
基金信息
论文图表
同行评议
相关论文
评论

站内网络爬虫的设计与实现

首发时间：2009-09-07

马栋 ¹ 崔晶晶 ¹ 王龙江 ¹ 王悦平 ¹

1、中国矿业大学计算机科学与技术学院

摘要：搜索引擎一直专注于提升用户的体验度，其用户体验度则反映在三个方面：准、全、快。用专业术语讲是：查准率、查全率和搜索速度（即搜索耗时）。其中最易达到的是搜索速度，因为对于搜索耗时在1秒以下的系统来说，访问者很难辨别其快慢了，所以不做讨论。对于中文搜索引擎的“准”，救市要保证搜索结果前几十项就要给出用户想要的信息，这个涉及到“网页排序”也不在本文的讨论范围；对于中文搜索引擎的“全”则需保证不遗漏某些重要的结果，而且能找到最新的网页，这需要搜索引擎有一个强大的网页收集器，一般称为“网络爬虫”，或是“网络蜘蛛”。本文给出了一种站内网络爬虫设计实现方法，是以一定的策略实现从互联网上爬取各种各样的网络资源，并将爬取得到网页进行正确的解析从而将网页保存到本地的网页库中以便检索，这种爬虫为实现基于主题的第四代搜索引擎奠定了资源基础。

关键词：网络爬虫爬虫下载器爬虫更新器

For information in English, please click here

The Design and Implementation of web crawler in web-station

Ma Dong ¹ Cui Jingjing ¹ Wang Longjiang ¹ Wang Yueping ²

1、School of Computer Science and Technology, China University of Mining and Technology
2、School of Mines, China University of Mining and Technology

Abstract：Search engine has always been focused on improving user’s experiences; the degree of user’s experiences is reflected in three aspects: accuracy, completeness and quickness. Use a jargon saying: precision ratio, the recall ratio and the search speed (i.e. searching-time). One of the most easily achieved aspects is searching speed, because it is difficult for visitors to recognize the speed of a system whose searching time is below 1 second, then it is not being discussed here. For the accuracy of Chinese search engine, the save market must ensure the wanted information will be given to the users ahead of the search results items. This area refers to “the sequence of web pages”, so it is not being discussed in this paper, too. For completeness, some important information must be ensured no missing and latest web pages can be found. This requires the search engine owns a powerful web page collector, which, usually called “web crawler” or “web spider”.This paper presents a method of designing inside-station web crawler. This method which based on certain strategy of crawling various network resources from the Internet can analyze the crawled web pages precisely and then store them into local web page library for searching. This kind of web crawler provides resource bases for realizing the thematically-based, forth-generation search engine.

Keywords： web crawler crawler downloader crawler updater

基金：

论文图表：

引用

导出参考文献

.txt

.ris

.doc

马栋，崔晶晶，王龙江，等. 站内网络爬虫的设计与实现[EB/OL]. 北京：中国科技论文在线 [2009-09-07]. https://www.paper.edu.cn/releasepaper/content/200909-160.

No.3494448195912523****

同行评议

共计0人参与

全部评论

0/1000

论文编号	200909-160
论文题目	站内网络爬虫的设计与实现
文献类型
收录期刊	上传封面中文期刊英文期刊期刊名称（中文）期刊名称（英文）年，卷（）上传封面中文专著英文专著书名（中文）书名（英文）出版地出版社出版年上传封面中文译著英文译著书名（中文）书名（英文）出版地出版社出版年上传封面中文论文集英文论文集编者.论文集名称（中文） [c]. 出版地出版社出版年， - 编者.论文集名称（英文） [c]. 出版地出版社出版年，- 上传封面中文文献英文文献期刊名称（中文）期刊名称（英文）日期-- 在线地址http:// 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期-- 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期--
英文作者写法：中外文作者均姓前名后，姓大写，名的第一个字母大写，姓全称写出，名可只写第一个字母，其后不加实心圆点“.”, 作者之间用逗号“，”分隔，最后为实心圆点“.”, 示例1：原姓名写法：Albert Einstein,编入参考文献时写法：Einstein A. 示例2：原姓名写法：李时珍；编入参考文献时写法：LI S Z. 示例3：YELLAND R L,JONES S C,EASTON K S,et al.