基于模式识别的论坛信息提取方法的研究与实现
首发时间:2013-09-23
摘要:本文根据舆情挖掘的特性,提出并实现了一种提取论坛网页中抽取正文信息的方法。该方法分为模式识别和正文提取两部分。模式识别首先对网页建立DOM树并添加上子树编号,然后清理导航栏,最终找到正文段,并递归提取出模式。正文提取利用模式识别提取出的模式取得正文。该方法不需要人工干预,对论坛的抽取准确率可以达到98%,且具有很高的效率。采用该方法实现的论坛信息提取工具满足了舆情挖掘系统的需求。
For information in English, please click here
Research and implement of forum information extraction based on pattern recognition
Abstract:This paper proposes and implements an approach of page text extraction, according to the characteristics of public sentiment mining. The approach is divided into two parts, pattern recognition and text extraction. In first part, we set up a DOM tree and append the child number. Then we remove the navigation part , in the end find the main text and get the pattern by recursion. In the second part,we use the pattern to extract other page text. This approach is efficient without human intervention. And the accuracy is 98%. The forum information extraction based by this approach satisfys the public sentiment mining system.
Keywords: forum information extraction, pattern recognition, public sentiment mining
基金:
论文图表:
引用
No.****
同行评议
共计0人参与
勘误表
基于模式识别的论坛信息提取方法的研究与实现
评论
全部评论0/1000