基于spark平台的K-means改进算法
首发时间:2017-12-05
摘要:K-means算法是较为经典的聚类算法。针对经典的K-means算法存在的K值个数和初始聚类中心需要人为指定的缺陷,以及经典的串行K-means算法在面对海量数据时性能不足的问题,提出了一种canopy-Kmeans算法。该算法引入canopy算法,作为K-means算法的前置算法,得到初始聚类中心点和K 值,并结合并行化编程框架 Spark ,实现算法的并行化,充分利用spark的内存计算优势,提高聚类效率。通过实验表明,canopy-Kmeans算法相较于传统的串行K-means算法和未经改进的并行算法,在准确率和效率上均有提升。
For information in English, please click here
The advanced K-means based on spark
Abstract:Aiming at the problem that the number of K values and initial clusteringcenter in classical K-means algorithmneed to be artificially specified and that classical serial K-means algorithm in the face of massive data, a canopy- Kmeans algorithm is raised. The algorithm introduces the canopy algorithm as a pre-algorithm of the K-means algorithm to get the initial clustering center point and K value, and combines the Spark framework to parallelize the algorithm. It takes full advantage of Spark\'s memory computing advantages and improves the clustering efficiency. Experiments show that the canopy-Kmeans algorithm has higher accuracy and efficiency than the traditional K-means algorithm and unmodified parallel algorithm.
Keywords: clustering algorithm,K-means algorithm,parallelization,spark
基金:
引用
No.****
同行评议
勘误表
基于spark平台的K-means改进算法
评论
全部评论0/1000