应用于微博的LDA模型改进
首发时间:2012-12-07
摘要:针对微博短文本高维稀疏的特点,主题模型被广泛研究用于微博文本聚类。作者主题模型(ATM)作为对热门主题模型LDA的有效拓展也用于微博文本挖掘。然而应用于微博文本挖掘,ATM具有两个缺点,其一是一篇文档中的单词只能由一个作者产生,其二是没有考虑到微博这种文本形式具有的内在结构信息。针对以上两点,对ATM模型进行改进,提出了新的改进算法--用户与关联扩展LDA(ULLDA)。并在NLPIR数据集上进行了验证,证实改进模型能有效地运用于微博文本挖掘,性能较ATM有所改进。
关键词: 数据挖掘 潜在狄利克雷分布模型(LDA) 吉布斯抽样
For information in English, please click here
The Improvement of LDA Applying in Microblog
Abstract:Aiming at sparse high-dimension problem of microblog, topic model is widely researched in text clustering of microblog. Author Topic Model(ATM) , which is an effective extending of Latent Dirichlet Allocation(LDA), is also used to the same purpose. However, there are two disadvantages while ATM is used. The one is that all the words in an article are generated by only one author, the other one is that ATM doesn't take into account of the inside structure information of microblog. To solve these two problems, an improvement on ATM is presented, and the new model is called ULLDA. The proving is given based on the dataset of NLPIR, proving that ULLDA is useful for the text clustering of microblog and it can improve the performance of ATM.
Keywords: Data Mining Latent Dirichlet Allocation Gibbs Samping
基金:
论文图表:
引用
No.****
同行评议
勘误表
应用于微博的LDA模型改进
评论
全部评论0/1000