主题网络爬虫研究与C#实现
首发时间:2008-11-19
摘要:本文从对比通用网络爬虫与主题网络爬虫的需求与实现机制出发,研究多中网络爬虫网页抓取策略的不同性能,并从中讨论较为适合主题网络爬虫的网络抓取策略与算法,其中主要为Fish-Search算法和Shark-Search算法。并且通过研究网络爬虫的实现过程、技术方法以及不同网页抓取方案的效率,提出一套主题网络爬虫的实现结构与方法,并对如何使用C#实现此网络爬虫进行介绍。此网络爬虫可用于多进程或者多机器配合抓取网页,在考虑网络服务器的负载问题和robots.txt的同时,也具有较高的网页抓取效率。此网络爬虫可用于多种数据信息系统,包括垂直搜索引擎、主题信息数据抓取收集系统等。
For information in English, please click here
Research on and implementation of topic web crawler
Abstract:This paper provides the introduction of the difference of general web crawler and topic web crawler, and comparision between multiple strategies for web crawling. In the paper gives the better effective strategy for topic web crawler which is Fish-Search or Shark-Search. Later this paper provides a framework and the implementation of important part using C# of a topic web crawler based on the technique, implementation of the topic web crawler, which is effective and considers robots.txt of the sites. The framework can be used in a lot of information systems, such as focused search engines.
Keywords: Topic Crawler C# Web mining
基金:
论文图表:
引用
No.2586724554612270****
同行评议
勘误表
主题网络爬虫研究与C#实现
评论
全部评论0/1000