基于Heritrix网络爬虫算法的研究与应用

范先爽; 刘东飞

0
0
浏览
下载

摘要
关键词
基金信息
论文图表
同行评议
相关论文
评论

基于Heritrix网络爬虫算法的研究与应用

首发时间：2010-12-03

范先爽 ¹
范先爽(1985-)，男，硕士在读，计算机网络
刘东飞 ¹

1、武汉理工大学计算机科学与技术学院

摘要：本文首先对搜索引擎中的网络爬虫进行了介绍，详细分析了开源网络爬虫Heritrix的系统结构。在此基础上，提出了设计特定的解析器，解析特定网站网页实现定制抓取的目的。然后通过消除robots.txt文件对个别处理器的影响，以及引入ELFHash算法实现了高效、多线程抓取Web资源的目的。最后通过对改进前后的爬虫抓取网页的速度对比，以及在同等时间的情况下抓取网页个数分析，验证了改进后的爬虫性能有了较明显的提高。

关键词：计算机应用网络爬虫 Heritrix ELFHash算法

For information in English, please click here

Study And Application Of Web Crawler Algorithm Based On Heritrix

FAN Xianshuang ¹
范先爽(1985-)，男，硕士在读，计算机网络
LIU Dongfei ¹

1、School of Computer Science and Technology,Wuhan University of Technology

Abstract：In this paper,the web crawler in search engine was firstly introduced,based on the detailed analysis of the system architecture about open source web crawler Heritrix,Proposed design of a particular parser,parsed the particular web site to achieve the purpose of particular crawl.Then by eliminating the impact on individual processors caused by robots.txt file,and introduced the ELFHash algorithm implements the purpose of efficient, multi-thread access to the web crawler resources.Finally,by the comparison of the speed of crawl web page between before-improved and after-improved,and the analysis of the number of crawled pages in the same long time,verify the performance of the after-improved web crawler has been more obvious increased.

Keywords： computer application web crawler Heritrix ELFHash algorithm

基金：

论文图表：

引用

导出参考文献

.txt

.ris

.doc

范先爽，刘东飞. 基于Heritrix网络爬虫算法的研究与应用[EB/OL]. 北京：中国科技论文在线 [2010-12-03]. https://www.paper.edu.cn/releasepaper/content/201012-53.

No.4393373553897129****

同行评议

共计0人参与

全部评论

0/1000

论文编号	201012-53
论文题目	基于Heritrix网络爬虫算法的研究与应用
文献类型
收录期刊	上传封面中文期刊英文期刊期刊名称（中文）期刊名称（英文）年，卷（）上传封面中文专著英文专著书名（中文）书名（英文）出版地出版社出版年上传封面中文译著英文译著书名（中文）书名（英文）出版地出版社出版年上传封面中文论文集英文论文集编者.论文集名称（中文） [c]. 出版地出版社出版年， - 编者.论文集名称（英文） [c]. 出版地出版社出版年，- 上传封面中文文献英文文献期刊名称（中文）期刊名称（英文）日期-- 在线地址http:// 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期-- 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期--
英文作者写法：中外文作者均姓前名后，姓大写，名的第一个字母大写，姓全称写出，名可只写第一个字母，其后不加实心圆点“.”, 作者之间用逗号“，”分隔，最后为实心圆点“.”, 示例1：原姓名写法：Albert Einstein,编入参考文献时写法：Einstein A. 示例2：原姓名写法：李时珍；编入参考文献时写法：LI S Z. 示例3：YELLAND R L,JONES S C,EASTON K S,et al.