Time-Series based Dataset Selection Method for Effective Text Classification

Chae, Yeonghun;Jeong, Do-Heon;

doi:10.5392/JKCA.2017.17.01.039

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Volume 17 Issue 1
/
Pages.39-49
/
2017
/
1598-4877(pISSN)
/
2508-6723(eISSN)

The Korea Contents Association (한국콘텐츠학회)

DOI QR Code

Time-Series based Dataset Selection Method for Effective Text Classification

효율적인 문헌 분류를 위한 시계열 기반 데이터 집합 선정 기법

Chae, Yeonghun (UST) ;
Jeong, Do-Heon (KISTI)

채영훈 (과학기술연합대학원대학교 빅데이터과학과) ;
정도헌 (한국과학기술정보연구원)

Received : 2016.11.07
Accepted : 2016.12.19
Published : 2017.01.28

https://doi.org/10.5392/JKCA.2017.17.01.039 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

As the Internet technology advances, data on the web is increasing sharply. Many research study about incremental learning for classifying effectively in data increasing. Web document contains the time-series data such as published date. If we reflect time-series data to classification, it will be an effective classification. In this study, we analyze the time-series variation of the words. We propose an efficient classification through dividing the dataset based on the analysis of time-series information. For experiment, we corrected 1 million online news articles including time-series information. We divide the dataset and classify the dataset using SVM and $Na{\ddot{i}}ve$ Bayes. In each model, we show that classification performance is increasing. Through this study, we showed that reflecting time-series information can improve the classification performance.

인터넷 기술이 발전함에 따라 온라인상의 데이터는 급격하게 증가하고 있고, 증가하는 데이터에 대해 점진적인 기계학습 기법을 통해 효율적으로 학습하기 위한 연구가 진행되고 있다. 온라인상의 문서는 대부분 게시일, 출판일과 같은 시계열적 정보를 포함하고 있고, 이를 분류에 반영한다면 효율적인 분류가 가능할 것이다. 본 연구에서는 웹 문서상에서 나타나는 어휘의 시계열적 변화를 분석하였고, 분석한 시계열 정보를 기반으로 데이터 집합을 분할하여 효율적인 분류 학습 기법을 제안한다. 실험 및 검증을 위해 온라인상의 뉴스 기사 100만 건을 시계열 정보를 포함하여 수집하였다. 수집된 데이터를 바탕으로 데이터 집합을 분할하여 $Na{\ddot{i}}ve$ Bayes 및 SVM 분류기를 사용하여 실험을 진행하였고, 각 모델에서 전체 데이터 집합 학습 대비 최대 2.02% 포인트, 2.32% 포인트의 성능 향상을 확인하였다. 본 연구를 통해 시계열적 어휘의 변화를 분류에 반영하여 분류의 성능을 향상시킬 수 있음을 확인하였다.

Keywords

References

B. Croft, "Machine Learning and Information Retrieval," ICML '95, 1995.
E. Jessica, "Forecast: Mobile Data Traffic, Worldwide, 2011-2018," Gartner, 2015.
H. Chih and N. Kulathuramaiyer, "An empirical study of feature selection for text categorization based on term weightage," In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp.599-602, 2004.
D. Jeong, J. Kim, M. Hwang, S. Song, and H. Jung, "Classification Method by Integrating Feature PropertyMatrices for Large Scale Data," SMA, 2012.
A. McCallum and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification," AAAI '98, 1998.
Irina Rish, An empirical study of the naive Bayes classifier, IBM Research Report, 2001.
C. Cortes andV. Vapnik, "Support-Vector Net-works," Machine Learning, 제20권, 제3호, pp.273-297, 1995. https://doi.org/10.1007/BF00994018
B. E. Boser, I. M. Guyon, and V. N. Vapnik, "A training algorithm for optimal margin classifiers," COLT '92, 1992.
H. Taira and M. Haruno, "Feature selection in SVM text categorization," AAAI, 1999.
F. Colas and P. Brazdil, "Comparison of SVM and some older classification algorithms in text classification tasks," IFIP, 2006.
Pascal Soucy and Guy W. Mineau, "Beyond TF -IDF Weighting for Text Categorization in the Vector Space Model," IJCAI, 제5권, pp.1130-1135, 2005.
G. Forman, "BNS Feature Scaling: An Improved Representation over TF.IDF for SVM Text Classification," ACM, 2008.
Yiming Yang and Jan O. Pedersen, "A comparative study on feature selection in text categorization," ICML, 제97권, pp.412-420, 1997.
Saket S. R. Mengle and Nazli Goharian, "Ambiguity Measure Feature-Selection Algorithm," Journal of the American Society for Information Science and Technology, 제60권, 제5호, pp.1037-1050, 2009. https://doi.org/10.1002/asi.21023
정도헌, "최대 개념강도 인지기법을 이용한 데이터베이스 자동선택 방법에 관한 연구," 정보관리학회지, 제27권, 제3호, pp.265-281, 2010. https://doi.org/10.3743/KOSIM.2010.27.3.265
J. Gim, Y. Jang, D. Jeong, and H. Jung, "Anayzing Email Patterns with Timelines on Researcher Data," JIST 2014, 2014.
Derry Tanti Wijaya and Reyyan Yeniterzi, "Understanding Semantic Change of Words Over Centuries," DETECT, 2011.
Do-Heon Jeong and Min Song, "Time gap analysis by he topic model-based temporal technique," Journal of Informetrics, 제8권, 제3호, pp.776-790, 2014. https://doi.org/10.1016/j.joi.2014.07.005
https://nodejs.org
http://visjs.org
http://www.highcharts.com
정도헌, 정창후, 김장원, 김태홍, 빅데이터 마이닝을 위한 점진적 학습 기술 개발, KISTI 성과보고서, 2015.

The Journal of the Korea Contents Association (한국콘텐츠학회논문지)

Time-Series based Dataset Selection Method for Effective Text Classification

효율적인 문헌 분류를 위한 시계열 기반 데이터 집합 선정 기법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)