基于站点的网络爬虫系统的设计及实现
首发时间:2008-09-11
摘要:随着互联网的发展以及网上信息的日益丰富,传统的信息处理已经延伸到互联网领域。在对互联网上的信息进行处理时,常常要将分布在互联网各处的Web页面下载到本地供进一步处理。这便是所讨论的Web页面搜集工具——网络爬虫系统的核心功能。本文介绍了一种基于站点的网络爬虫系统,即该系统采用一个站点对应一个网络爬虫线程,多网络爬虫并行工作的方式搜集网页。由于该系统采用了一个网络爬虫线程爬取一个站点的方式,所以使其能够根据用户的要求爬取他们关心的站点,从而使网络爬虫更加人性化;同时由于采用多线程并发工作方式,提高了爬取效率。文中该给出了爬虫的具体工作流程、URL库结构以及相关算法。
For information in English, please click here
Design And Implementation Of Website-Based Spider System
Abstract:With the growth pf Internet and the fact that information on Web are becoming abundant, Internet has become the new stage of traditional information processing. Before processing these web information, people often download the distributed web information to local storage for additional processing, which is the core function of the information-gathering system(spider system) described in this paper. This paper introduced a website-based spider system, which adopted a spider thread for a website, multi-threaded parallel work mode to retrieve web pages. And provided the detail design and implementation of the system.
Keywords: Search Engine Information Retrieval Spider
基金:
论文图表:
引用
No.2395431091412211****
同行评议
共计0人参与
勘误表
基于站点的网络爬虫系统的设计及实现
评论
全部评论0/1000