基于Heritrix网络爬虫算法的研究与应用
首发时间:2010-12-03
摘要:本文首先对搜索引擎中的网络爬虫进行了介绍,详细分析了开源网络爬虫Heritrix的系统结构。在此基础上,提出了设计特定的解析器,解析特定网站网页实现定制抓取的目的。然后通过消除robots.txt文件对个别处理器的影响,以及引入ELFHash算法实现了高效、多线程抓取Web资源的目的。最后通过对改进前后的爬虫抓取网页的速度对比,以及在同等时间的情况下抓取网页个数分析,验证了改进后的爬虫性能有了较明显的提高。
关键词: 计算机应用 网络爬虫 Heritrix ELFHash算法
For information in English, please click here
Study And Application Of Web Crawler Algorithm Based On Heritrix
Abstract:In this paper,the web crawler in search engine was firstly introduced,based on the detailed analysis of the system architecture about open source web crawler Heritrix,Proposed design of a particular parser,parsed the particular web site to achieve the purpose of particular crawl.Then by eliminating the impact on individual processors caused by robots.txt file,and introduced the ELFHash algorithm implements the purpose of efficient, multi-thread access to the web crawler resources.Finally,by the comparison of the speed of crawl web page between before-improved and after-improved,and the analysis of the number of crawled pages in the same long time,verify the performance of the after-improved web crawler has been more obvious increased.
Keywords: computer application web crawler Heritrix ELFHash algorithm
基金:
论文图表:
引用
No.4393373553897129****
同行评议
共计0人参与
勘误表
基于Heritrix网络爬虫算法的研究与应用
评论
全部评论0/1000