基于自动化信息抽取技术的垂直网络爬虫设计与实现
首发时间:2014-11-03
摘要: 随着垂直搜索引擎搜索范围的扩大,如何自动化高效地完成数据爬取任务成了一个重要的问题。目前大多数的网络爬虫使用人工定义规则来完成对数据的抽取工作,效率低下。本文首先对自动化信息抽取网络爬虫进行了框架设计和优化,然后针对爬虫抓取问题详分析了开源网络爬虫框架Scrapy并且给出了优化方案;针对信息自动化抽取问题分析了自动模板生成算法RoadRunner算法并且给出了优化方案;针对爬取Ajax网页问题分析了Ajax爬取工具Scrapyjs。最后对基于自动化信息抽取技术的网络爬虫从爬取效率和抽取准确率两个方面进行了测试,给出了测试结果和分析。
关键词: 信息抽取 RoadRunner 网络爬虫 Scrapy
For information in English, please click here
Web Crawler based on automatic information extraction technology
Abstract:With the expansion of the scope of vertical engines, how to accomplish data crawling task automatically and efficiently become a important issue. Most web crawling spiders use artificial defined rules to complete the data extraction work and it is inefficient.The paper carry out a framwork desgin and optimization for the web crawler based on automatic information extraction technology, then analyzes the details of the open source web crawler framework Scrapy and gives its optimization according to the crawling problem, analyzes the automatic template generation algorithm RoadRunner and gives its optimization according to the automatic information extraction problem, analyzes the Ajax crawling tool Scrapyjs accroding the crawling Ajax problem. Finally, the performane of this web crawler is tested on crawling efficiency and extracting accuracy and presented
Keywords: Information extraction RoadRunner Web Crawler Scrapy
基金:
论文图表:
引用
No.4615714100933814****
同行评议
共计0人参与
勘误表
基于自动化信息抽取技术的垂直网络爬虫设计与实现
评论
全部评论0/1000