基于改进数据集结构的高效用数据挖掘算法研究
首发时间:2019-11-20
摘要:高效用项集挖掘(High-Utility Itemset Mining,HUIM)是数据挖掘中的重要任务之一。 相比于频繁项集挖掘(Frequent Itemset Mining,FIM),HUIM 会综合数量和利润两个因素来找出合适的项集,而不仅仅考虑数量,应用场景更加广泛。 基于项集效用列表(utility-list) 结构的单阶段HUIM 算法因为可以在不生成候选解的情况下直接挖掘高效用项目集(High-Utility Itemset,HUI)是目前最有效的算法之一。 然而,创建并维持多个utility-list 结构会消耗大量的时间和内存,尤其在比较大的密集型数据集上。 为解决此问题,本文提出一种新的基于改进数据集结构的高效用数据挖掘(Efficient high-utility itemset mining based on a novel data structure,EIM-DS)算法。 在EIM-DS 算法中,通过新的数据集结构来重构数据集能够有效地挖掘出所有的高效用项集并且减少在挖掘过程中的内存使用。 同时,算法提出了两种新的剪枝策略:拓展集剪枝和局部TWU剪枝,能够较大地缩小搜索空间。 在密集型和稀疏型数据集上的结果表明,EIM-DS 算法执行时间更少,内存消耗更低。
关键词: 数据挖掘, 高效用数据挖掘, 模式挖掘
For information in English, please click here
EIM-DS: Efficient high-utility itemset mining based on a novel data structure
Abstract:High-utility itemset mining (HUIM) is an important tasks in data mining. Compared to frequent itemset mining (FIM), HUIM considers the quantity and profit factors to reveal the most profitable products, rather than the frequency factor. The one-phase HUIM algorithms based on utility-list structure have been shown to be one of the most efficient ones since they can mine high-utiliy itemsets (HUIs) without generating candidates. However, storing itemset information for utility list is time consuming and memory consuming, especially on the dense datasets with long transactions. To address the problem, a novel HUIM algorithm, which is called efficient high-utility itemset mining based on a novel data structure (EIM-DS), is proposed in this paper. In EIM-DS, a novel data structure is designed by reorganizing the transaction database in order to get all HUIs effectively and reduce memory useage in the depth-first search process. Based on the novel data structure, the extensions utility and local TWU utility are proposed in this paper and used to as the upper bounds, which can reduce the search space greatly from width and depth especally on dense datasets. Experimental results on the dense and sparse benchmark datasets show that the proposed EIM-DS has better results for mining HUIs compared to the state-of-the-art algorithms in terms of running time and memory usage.
Keywords: data mining, high-utility itemset mining, pattern mining
基金:
引用
No.****
动态公开评议
共计0人参与
勘误表
基于改进数据集结构的高效用数据挖掘算法研究
评论
全部评论0/1000