爬虫抽取失效预警方案的设计与实现
首发时间:2015-12-07
摘要:随着互联网的发展和普及,互联网已经成为人们生活必不可少的一部分。而随着互联网上数据量的膨胀,针对某一领域的垂直聚合变应运而生,视频垂直内容聚合便是其中的一个重要领域。本文针对视频内容聚合领域的爬虫系统经常遇到的来源网站改版导致爬虫失效的问题进行了研究,并指定出一套感知爬虫系统改版的方案,通过在爬虫系统执行时进行数据记录以及爬虫系统后续进行数据整理,从剧集缺失、分集缺失、字段抽取不完全等方面指定了针对性的解决方案,并经过实验获得了一定的效果。
For information in English, please click here
Design and Implementation of the Extract Failure Warning Plan for Web Crawler
Abstract:With the development and popularization of the Internet, the Internet has become an indispensable part of people's lives. And with the amount of data expands on the Internet, vertical polymerization for a particular area comes into being, video content vertical syndication is an important area of it. In this paper, the failure problem of crawler system in the field of video content aggregation was studied. The problem is because the page style of the video web site is changing frequently. According to the problems, a solution is designed to process the revision of source video web site. The solution implements by recording the data when the crawler system is running and arranging these data when the crawler system is finished. To solve the failure of crawl and extract web page, missing dramas, missing episodes and incomplete fields extraction will be counted and analyzed. And through this solution, a certain effect has been got.
Keywords: crawl system failure of extractation page revision warning plan
基金:
论文图表:
引用
No.4665581111660714****
同行评议
共计0人参与
勘误表
爬虫抽取失效预警方案的设计与实现
评论
全部评论0/1000