内容汇聚子系统中可定制爬虫引擎的设计与实现
首发时间:2017-12-04
摘要:Web2.0下的新媒体业务不再局限于生产媒体素材,新媒体业务往往通过爬虫引擎抓取大量的媒体资源网站获得媒体素材。这就要求新媒体业务下的爬虫引擎具有弹性的架构。针对媒体网站数据抓取的业务需求,深入讨论了构建可定制爬虫引擎的相关技术,借助规则引擎思想,使用简洁的规则文件即可实现媒体网站的垂直抓取,使得爬虫引擎具有良好的扩展性和灵活性,有效降低了爬虫开发与维护成本。
For information in English, please click here
DESIG AND IMPLEMENTATION OF CUSTOMIZABLE CRAWLER ENGINE IN CONTENT AGGREGATION SUBSYSTEM
Abstract:The new media business under Web2.0 is no longer confined to the production of media material, new media services often crawl a large number of media website through the crawler engine to get media materials. This requires that the crawler engine in the new media business has an elastic architecture. In view of the business requirements of data grbbing in media websites, the paper discusses the construction of customizable crawler engines. With the idea of rule engine, we can use the simple rule file to realize the vertical grab of the media website, which makes the crawler engine have good scalability and flexibility, and effectively reduces the cost of crawler development and maintenance.
Keywords: web crawler rule engine new media business
基金:
引用
No.****
动态公开评议
共计0人参与
勘误表
内容汇聚子系统中可定制爬虫引擎的设计与实现
评论
全部评论0/1000