中文网页自动分类研究及分类算法的设计与实现
首发时间:2007-09-19
摘要:本文回顾了中文网页自动分类的发展过程和研究现状,说明了本文中网页自动分类是采用文本自动分类的方法;指出了网页分类的难点及突出问题;关于分类算法,本文集成了向量空间模型架构下分类准确度最好的KNN算法和分类速度最快的Rocchio算法,设计了一种Rocchio-KNN分类算法,该算法采用Rocchio方法进行类别过滤,再采用KNN方法进行细分。实验表明,这种方法在确保一定分类准确率的基础上,大大地提高了分类效率,能够满足对大规模样本集进行实时处理的需求;最后介绍了中文网页自动分类的前期工作及系统构架。
For information in English, please click here
Automatic Chinese Web Page Classification
Abstract:This paper reviews the development and present research status of automatic Chinese web page classification techniques, and then makes it clear that automatic Chinese web page classification is based on automatic text classification. For classification algorithm, an algorithm named Rocchio-KNN is designed, which mixed the Rocchio algorithm which is the fastest method in VSM and KNN algorithm whose precision is the highest in VSM. Experiment results show that Rocchio-KNN algorithm is close to Rocchio algorithm at the speed and KNN algorithm in precision. The experiment indicates that this kind of method improves the processing speed greatly in guaranteeing the foundation of certain precision, which can meet the demand which carries on real-time processing to the extensive sample. Finally we talk about the preparatory work and the framework of Chinese web page automatic classification system.
Keywords: Web page content extraction, Automatic text classification, Automatic classification algorithms
基金:
论文图表:
引用
No.1520012844911901****
同行评议
勘误表
中文网页自动分类研究及分类算法的设计与实现
评论
全部评论0/1000