基于维基百科的主题特征扩展研究
首发时间:2018-04-04
摘要:文本自动分类是自然语言处理中一个重要的研究方向,其在数据挖掘与信息检索领域中扮演着重要的角色。针对传统向量空间模型特征空间维数过高的问题,以及基于LDA主题模型的文本分类算法在特征表述上的局限性,本文提出了一种基于维基百科的主题特征扩展的方法用来实现降维的同时加强文本的主题特征表述。通过在主题特征层面对文本特征进行扩展,不仅能够降低文本特征的维数,提高分类效率,还能够提高文本分类的效果。最后在20Newsgroup和NSF数据集上进行的分类实验表明,该方法具有较好的分类效果。
关键词: 计算机应用技术;文本分类 LDA 维基百科 特征扩展
For information in English, please click here
Topic Feature Extention Based On Wikipedia
Abstract:Text classification is an important research in NLP and plays an important role in text mining and information retrieval. Since the curse of dimensionality in traditional Vector Space Model and the limitation of semantic feature representation based on LDA topic model, we propose a topic feature extension method based on Wikipedia to achieve dimensionality reduction and strengthen the text feature expression. By extending the text feature at topic feature level, we can not only reduce the dimensionality of text features to get the classification efficiency improved, but also improve the effect of text classification. Finally, we conduct some classification experiments on the 20Newsgroup and NSF datasets to show that this method has better classification effect.
Keywords: Computer Application Technology Text classification LDA Wikipedia Feature extention
基金:
引用
No.****
动态公开评议
共计0人参与
勘误表
基于维基百科的主题特征扩展研究
评论
全部评论0/1000