基于类别关键词搜索的移动应用商店DEEP WEB采集方法
首发时间:2017-07-21
摘要:随着移动互联网的快速发展,移动互联网进入大数据时代,移动应用数据分析需求愈加明显,从而对移动应用信息采集提出了更高的要求。目前,由于应用数量过于庞大,移动应用商店只将部分应用信息展示在以超链接可以到达的静态网页中,而将大量信息隐藏在查询表单后的Deep Web中,导致已有的爬虫策略采集的应用信息完整率较低。基于上面的挑战,本文提出一种基于应用类别关键词搜索的采集方法,通过增量式爬取策略提高移动应用商店信息采集的完整率和补全效率。首先,基于垂直型爬虫获取可以跳转到的各类别应用界面的应用信息,然后利用TF-IDF算法从应用名称和描述信息中提取代表各类别应用的关键词,最后,使用基于关键词查询的采集方法进行增量式爬取。本文通过对10个覆盖10多种类别的移动应用商店进行实验分析,发现本方法具有很高的应用信息采集完整率和采集效率。
For information in English, please click here
A DEEP WEB Collection Method for Mobile Application Store Based on Category Keywords Query
Abstract:With the rapid development of mobile Internet, mobile Internet has came into the era of big data, the demand of analysis on mobile app imformation become more and more apparent, thus the standard of mobile app information collection puted forward higher. Due to the large number of applications, almost all Third-party app stores only show little part of the applications in hyperlinks which can be reached, and majority information hidden in Deep Web database after the query form. The existing crawler strategy can not meet the demand. Based on the above challenges, this paper proposes a collection method based on category keywords queryto improve the integrity and completion efficiency of the mobile app stores information collection. Firstly,crawl information of Surface Web based on vertical crawler, then extract the keywords that represent each category of applications by TF-IDF algorithm from the application name and description information. Finally, incremental crawl Deep Web information based on Keyword queries. Results show that this collection method effectively promoted information integrity and acquisition efficiency.
Keywords: Deep Web TF-IDF algorithm incremental crawling
基金:
论文图表:
引用
No.4738474119585514****
同行评议
共计0人参与
勘误表
基于类别关键词搜索的移动应用商店DEEP WEB采集方法
评论
全部评论0/1000