支持AJAX的分布式爬虫的研究与设计
首发时间:2015-12-21
摘要:随着Web2.0时代的到来,以AJAX为代表的动态技术被广泛应用,传统爬虫无法完整获取使用了AJAX异步通信和动态加载技术的网站的页面内容,大大影响了爬虫的功能。此外,随着网站功能不断丰富,网站规模也变得越来越大,单机爬虫爬取速度等方面的性能已经不能满足要求。针对上述问题,本文研究并设计了支持AJAX的分布式爬虫,通过内嵌浏览器内核来解析执行JavaScript/AJAX,获取更多的页面内容;同时研究并设计了分布式爬虫,扩展计算资源,提高爬虫的效率和稳定性。
For information in English, please click here
Research and Design of Support AJAX Distributed Crawler
Abstract:With the arrival of the Web2.0 era, dynamic technology represented by AJAX has been widely used to websites, the traditional crawler cannot get the complete page content of a web page which using AJAX to communicate asynchronously and load site dynamically, the function of crawler has been affected severely. In addition, with the site's features continue to enrich, the size of the website has become more and more large, the performance of a single crawler has been unable to meet the requirements. In view of the above problems, this paper analyzes and designs a distributed crawler, which supports AJAX. More pages content can be get through the embedded browser kernel which parses JavaScript/AJAX, and the distributed architecture is benefit to expand the computing resources and improve the efficiency and stability of the crawler.
Keywords: Web Crawler Distributed Technology AJAX WebKit
论文图表:
引用
No.4671215112421014****
同行评议
共计0人参与
勘误表
支持AJAX的分布式爬虫的研究与设计
评论
全部评论0/1000