基于自动化信息抽取技术的垂直网络爬虫设计与实现

张建宇; 王洪波

0
0
浏览
下载

摘要
关键词
基金信息
论文图表
同行评议
相关论文
评论

基于自动化信息抽取技术的垂直网络爬虫设计与实现

首发时间：2014-11-03

张建宇 ¹
张建宇（1989-），男，硕士研究生，主要研究方向：计算机应用
王洪波 ¹
王洪波（1975-），男，副教授，主要研究方向：云计算与数据中心网络等。

1、北京邮电大学网络与交换技术国家重点实验室，北京 1000876

摘要：随着垂直搜索引擎搜索范围的扩大，如何自动化高效地完成数据爬取任务成了一个重要的问题。目前大多数的网络爬虫使用人工定义规则来完成对数据的抽取工作，效率低下。本文首先对自动化信息抽取网络爬虫进行了框架设计和优化，然后针对爬虫抓取问题详分析了开源网络爬虫框架Scrapy并且给出了优化方案；针对信息自动化抽取问题分析了自动模板生成算法RoadRunner算法并且给出了优化方案；针对爬取Ajax网页问题分析了Ajax爬取工具Scrapyjs。最后对基于自动化信息抽取技术的网络爬虫从爬取效率和抽取准确率两个方面进行了测试，给出了测试结果和分析。

关键词：信息抽取 RoadRunner 网络爬虫 Scrapy

For information in English, please click here

Web Crawler based on automatic information extraction technology

ZHANG Jianyu ¹
张建宇（1989-），男，硕士研究生，主要研究方向：计算机应用
WANG Hongbo ¹
王洪波（1975-），男，副教授，主要研究方向：云计算与数据中心网络等。

1、State Key Laboratory of Networking and Switching Technology, Beijing University of Posts & Telecommunications, Beijing 100876

Abstract：With the expansion of the scope of vertical engines, how to accomplish data crawling task automatically and efficiently become a important issue. Most web crawling spiders use artificial defined rules to complete the data extraction work and it is inefficient.The paper carry out a framwork desgin and optimization for the web crawler based on automatic information extraction technology, then analyzes the details of the open source web crawler framework Scrapy and gives its optimization according to the crawling problem, analyzes the automatic template generation algorithm RoadRunner and gives its optimization according to the automatic information extraction problem, analyzes the Ajax crawling tool Scrapyjs accroding the crawling Ajax problem. Finally, the performane of this web crawler is tested on crawling efficiency and extracting accuracy and presented

Keywords： Information extraction RoadRunner Web Crawler Scrapy

基金：

论文图表：

引用

导出参考文献

.txt

.ris

.doc

张建宇，王洪波. 基于自动化信息抽取技术的垂直网络爬虫设计与实现[EB/OL]. 北京：中国科技论文在线 [2014-11-03]. https://www.paper.edu.cn/releasepaper/content/201411-26.

No.4615714100933814****

同行评议

共计0人参与

全部评论

0/1000

论文编号	201411-26
论文题目	基于自动化信息抽取技术的垂直网络爬虫设计与实现
文献类型
收录期刊	上传封面中文期刊英文期刊期刊名称（中文）期刊名称（英文）年，卷（）上传封面中文专著英文专著书名（中文）书名（英文）出版地出版社出版年上传封面中文译著英文译著书名（中文）书名（英文）出版地出版社出版年上传封面中文论文集英文论文集编者.论文集名称（中文） [c]. 出版地出版社出版年， - 编者.论文集名称（英文） [c]. 出版地出版社出版年，- 上传封面中文文献英文文献期刊名称（中文）期刊名称（英文）日期-- 在线地址http:// 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期-- 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期--
英文作者写法：中外文作者均姓前名后，姓大写，名的第一个字母大写，姓全称写出，名可只写第一个字母，其后不加实心圆点“.”, 作者之间用逗号“，”分隔，最后为实心圆点“.”, 示例1：原姓名写法：Albert Einstein,编入参考文献时写法：Einstein A. 示例2：原姓名写法：李时珍；编入参考文献时写法：LI S Z. 示例3：YELLAND R L,JONES S C,EASTON K S,et al.