Micro-blog noise filtering and topic detection
-
摘要: 针对微博中充斥着的大量广告信息和其它的噪声微博,本文提出了基于C4.5决策树分类算法的用户分类过滤机制和基于特征值的计分过滤方法。利用微博文本的实时性和微博话题的时效性,还提出了一个基于时间参数的相似度计算方法。实验结果表明,该方法能提高对噪声过滤和话题检测的准确率和效率。Abstract: Aiming at the big amount of advertising messages and other noise tweets, the paper proposed a user classification filtering mechanism based on C4.5 Decision Tree Classification Algorithm and a scoring filtering method based on characteristic value. Taking advantage of the instantaneity of micro-blog text and timeliness of micro-blog topic, the paper put forward a similarity calculation method based on time parameter. Experiments showed that this mechanism could detect topics and filter noise with better accuracy and efficiency compared to the traditional approach.
-
Key words:
- noise filtering /
- C4.5 Decision Tree /
- characteristic value /
- similarity calculation
-
[1] 郑斐然,苗夺谦,张志飞,高 灿. 一种中文微博新闻话题 检测的方法[J].计算机科学,2012,39(1). [2] Shota Ishikawa, Yutaka Arakawa, Shigeaki Tagashira, Akira Fuku- da. Hot Topic Detection in Local Areas Using Twitter and Wiki- pedia [J]. ARCS Workshops (ARCS), 28-29 Feb. 2012. [3] 邱 洋. 微博数据提取及话题检测方法研究[D].大连:大连 理工大学,2013. [4] Yukino Ikegami, Kenta Kawai, Yoshimi Namihira, Setsuo Tsuru- ta. Topic and Opinion Classification based Information Credibi- lity Analysis on Twitter[C]. 2013 IEEE International Conference on Systems, Man, and Cybernetics, 13-16 Oct. 2013. [5] 陆 旭.文本挖掘中若干关键问题研究[M]. 合肥 : 中国科学 技术大学出版社,2008. [6] Hao Tu, Jin Ding. An Efficient Clustering Algorithm for Microb- logging Hot Topic Detec-tion. Computer Science & Service Sys- tem (CSSS)[C]. 2012 International Conference on Computer Science and Service System, 11-13 Aug. 2012. [7] 刘 涛. 用于文本分类和文本聚类的特征选择和特征抽取方 法的研究[D].天津:南开大学,2004. [8] Jing Xie, Gongshen Liu, Wei Ning. A Topic Detection Method for Chinese Microblog[C]. 2012 Fourth International Symposium on Information Science and Engineering, 14-16 Dec. 2012. [9] 周 刚,部鸿程,熊小兵,等.MB-SinglePass:基于组合相似 度的微博话题检测[J].计算机科学,2012,39(10):198- 202. [10] Feifei Peng, Xu Qian, Hui Meng, Dan Zhou. Research on Algori- thm of Extracting Micro-blog’s Hot Topics. Electronics[C]. Communications and Control (ICECC), 2011 International Con- ference on Communications and Control, 9-11 Sept. 2011. [11] 程显毅,朱 倩.文本挖掘原理[M]. 北京:科学出版社, 2010. [12] Xiangying Dai, Qingcai Chen, Xiaolong Wang, Jun xu. Online Topic Detection and Track-ing of Financial News based on Hierar- chical Clustering[C]. Proceedings of the Ninth Interna-tional Con- ference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010.
点击查看大图