Time-Series based Dataset Selection Method for Effective Text Classification

효율적인 문헌 분류를 위한 시계열 기반 데이터 집합 선정 기법

  • 채영훈 (과학기술연합대학원대학교 빅데이터과학과) ;
  • 정도헌 (한국과학기술정보연구원)
  • Received : 2016.11.07
  • Accepted : 2016.12.19
  • Published : 2017.01.28


As the Internet technology advances, data on the web is increasing sharply. Many research study about incremental learning for classifying effectively in data increasing. Web document contains the time-series data such as published date. If we reflect time-series data to classification, it will be an effective classification. In this study, we analyze the time-series variation of the words. We propose an efficient classification through dividing the dataset based on the analysis of time-series information. For experiment, we corrected 1 million online news articles including time-series information. We divide the dataset and classify the dataset using SVM and $Na{\ddot{i}}ve$ Bayes. In each model, we show that classification performance is increasing. Through this study, we showed that reflecting time-series information can improve the classification performance.


SVM;$Na{\ddot{i}}ve$ Bayes;Time-Series Analysis;Machine Learning;Classification


  1. B. Croft, "Machine Learning and Information Retrieval," ICML '95, 1995.
  2. E. Jessica, "Forecast: Mobile Data Traffic, Worldwide, 2011-2018," Gartner, 2015.
  3. H. Chih and N. Kulathuramaiyer, "An empirical study of feature selection for text categorization based on term weightage," In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp.599-602, 2004.
  4. D. Jeong, J. Kim, M. Hwang, S. Song, and H. Jung, "Classification Method by Integrating Feature PropertyMatrices for Large Scale Data," SMA, 2012.
  5. A. McCallum and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification," AAAI '98, 1998.
  6. Irina Rish, An empirical study of the naive Bayes classifier, IBM Research Report, 2001.
  7. C. Cortes andV. Vapnik, "Support-Vector Net-works," Machine Learning, 제20권, 제3호, pp.273-297, 1995.
  8. B. E. Boser, I. M. Guyon, and V. N. Vapnik, "A training algorithm for optimal margin classifiers," COLT '92, 1992.
  9. H. Taira and M. Haruno, "Feature selection in SVM text categorization," AAAI, 1999.
  10. F. Colas and P. Brazdil, "Comparison of SVM and some older classification algorithms in text classification tasks," IFIP, 2006.
  11. Pascal Soucy and Guy W. Mineau, "Beyond TF -IDF Weighting for Text Categorization in the Vector Space Model," IJCAI, 제5권, pp.1130-1135, 2005.
  12. G. Forman, "BNS Feature Scaling: An Improved Representation over TF.IDF for SVM Text Classification," ACM, 2008.
  13. Yiming Yang and Jan O. Pedersen, "A comparative study on feature selection in text categorization," ICML, 제97권, pp.412-420, 1997.
  14. Saket S. R. Mengle and Nazli Goharian, "Ambiguity Measure Feature-Selection Algorithm," Journal of the American Society for Information Science and Technology, 제60권, 제5호, pp.1037-1050, 2009.
  15. 정도헌, "최대 개념강도 인지기법을 이용한 데이터베이스 자동선택 방법에 관한 연구," 정보관리학회지, 제27권, 제3호, pp.265-281, 2010.
  16. J. Gim, Y. Jang, D. Jeong, and H. Jung, "Anayzing Email Patterns with Timelines on Researcher Data," JIST 2014, 2014.
  17. Derry Tanti Wijaya and Reyyan Yeniterzi, "Understanding Semantic Change of Words Over Centuries," DETECT, 2011.
  18. Do-Heon Jeong and Min Song, "Time gap analysis by he topic model-based temporal technique," Journal of Informetrics, 제8권, 제3호, pp.776-790, 2014.
  22. 정도헌, 정창후, 김장원, 김태홍, 빅데이터 마이닝을 위한 점진적 학습 기술 개발, KISTI 성과보고서, 2015.