基于模式识别的论坛信息提取方法的研究与实现

王焕展; 辛阳

0
0
浏览
下载

摘要
关键词
基金信息
论文图表
同行评议
相关论文
评论

基于模式识别的论坛信息提取方法的研究与实现

首发时间：2013-09-23

王焕展 ¹
王焕展(1989-)，男，硕士，网络与信息安全、内容与系统安全
辛阳 ¹
辛阳（1977-），男，副教授，移动通信网络安全、计算机网络安全

1、北京邮电大学计算机学院

摘要：本文根据舆情挖掘的特性，提出并实现了一种提取论坛网页中抽取正文信息的方法。该方法分为模式识别和正文提取两部分。模式识别首先对网页建立DOM树并添加上子树编号，然后清理导航栏，最终找到正文段，并递归提取出模式。正文提取利用模式识别提取出的模式取得正文。该方法不需要人工干预，对论坛的抽取准确率可以达到98%，且具有很高的效率。采用该方法实现的论坛信息提取工具满足了舆情挖掘系统的需求。

关键词：论坛信息提取模式识别舆情挖掘

For information in English, please click here

Research and implement of forum information extraction based on pattern recognition

Wang Huanzhan ¹
王焕展(1989-)，男，硕士，网络与信息安全、内容与系统安全
Xin Yang ¹
辛阳（1977-），男，副教授，移动通信网络安全、计算机网络安全

1、School of Computer Science, Beijing University of Posts and Telecommunications

Abstract：This paper proposes and implements an approach of page text extraction, according to the characteristics of public sentiment mining. The approach is divided into two parts, pattern recognition and text extraction. In first part, we set up a DOM tree and append the child number. Then we remove the navigation part , in the end find the main text and get the pattern by recursion. In the second part,we use the pattern to extract other page text. This approach is efficient without human intervention. And the accuracy is 98%. The forum information extraction based by this approach satisfys the public sentiment mining system.

Keywords： forum information extraction, pattern recognition, public sentiment mining

基金：

论文图表：

引用

导出参考文献

.txt

.ris

.doc

王焕展，辛阳. 基于模式识别的论坛信息提取方法的研究与实现[EB/OL]. 北京：中国科技论文在线 [2013-09-23]. https://www.paper.edu.cn/releasepaper/content/201309-329.

No.****

同行评议

共计0人参与

全部评论

0/1000

论文编号	201309-329
论文题目	基于模式识别的论坛信息提取方法的研究与实现
文献类型
收录期刊	上传封面中文期刊英文期刊期刊名称（中文）期刊名称（英文）年，卷（）上传封面中文专著英文专著书名（中文）书名（英文）出版地出版社出版年上传封面中文译著英文译著书名（中文）书名（英文）出版地出版社出版年上传封面中文论文集英文论文集编者.论文集名称（中文） [c]. 出版地出版社出版年， - 编者.论文集名称（英文） [c]. 出版地出版社出版年，- 上传封面中文文献英文文献期刊名称（中文）期刊名称（英文）日期-- 在线地址http:// 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期-- 上传封面中文文献英文文献文题（中文）文题（英文）出版地出版社,出版日期--
英文作者写法：中外文作者均姓前名后，姓大写，名的第一个字母大写，姓全称写出，名可只写第一个字母，其后不加实心圆点“.”, 作者之间用逗号“，”分隔，最后为实心圆点“.”, 示例1：原姓名写法：Albert Einstein,编入参考文献时写法：Einstein A. 示例2：原姓名写法：李时珍；编入参考文献时写法：LI S Z. 示例3：YELLAND R L,JONES S C,EASTON K S,et al.