DOI QR코드

DOI QR Code

Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach

카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용

  • Lee, Minsik (Department of Information and Industrial Engineering, Yonsei University) ;
  • Lee, Hong Joo (Department of Business Administration, Catholic University of Korea)
  • 이민식 (연세대학교 정보산업공학과) ;
  • 이홍주 (가톨릭대학교 경영학부)
  • Received : 2017.04.09
  • Accepted : 2017.05.29
  • Published : 2017.06.30

Abstract

Since the stock market is driven by the expectation of traders, studies have been conducted to predict stock price movements through analysis of various sources of text data. In order to predict stock price movements, research has been conducted not only on the relationship between text data and fluctuations in stock prices, but also on the trading stocks based on news articles and social media responses. Studies that predict the movements of stock prices have also applied classification algorithms with constructing term-document matrix in the same way as other text mining approaches. Because the document contains a lot of words, it is better to select words that contribute more for building a term-document matrix. Based on the frequency of words, words that show too little frequency or importance are removed. It also selects words according to their contribution by measuring the degree to which a word contributes to correctly classifying a document. The basic idea of constructing a term-document matrix was to collect all the documents to be analyzed and to select and use the words that have an influence on the classification. In this study, we analyze the documents for each individual item and select the words that are irrelevant for all categories as neutral words. We extract the words around the selected neutral word and use it to generate the term-document matrix. The neutral word itself starts with the idea that the stock movement is less related to the existence of the neutral words, and that the surrounding words of the neutral word are more likely to affect the stock price movements. And apply it to the algorithm that classifies the stock price fluctuations with the generated term-document matrix. In this study, we firstly removed stop words and selected neutral words for each stock. And we used a method to exclude words that are included in news articles for other stocks among the selected words. Through the online news portal, we collected four months of news articles on the top 10 market cap stocks. We split the news articles into 3 month news data as training data and apply the remaining one month news articles to the model to predict the stock price movements of the next day. We used SVM, Boosting and Random Forest for building models and predicting the movements of stock prices. The stock market opened for four months (2016/02/01 ~ 2016/05/31) for a total of 80 days, using the initial 60 days as a training set and the remaining 20 days as a test set. The proposed word - based algorithm in this study showed better classification performance than the word selection method based on sparsity. This study predicted stock price volatility by collecting and analyzing news articles of the top 10 stocks in market cap. We used the term - document matrix based classification model to estimate the stock price fluctuations and compared the performance of the existing sparse - based word extraction method and the suggested method of removing words from the term - document matrix. The suggested method differs from the word extraction method in that it uses not only the news articles for the corresponding stock but also other news items to determine the words to extract. In other words, it removed not only the words that appeared in all the increase and decrease but also the words that appeared common in the news for other stocks. When the prediction accuracy was compared, the suggested method showed higher accuracy. The limitation of this study is that the stock price prediction was set up to classify the rise and fall, and the experiment was conducted only for the top ten stocks. The 10 stocks used in the experiment do not represent the entire stock market. In addition, it is difficult to show the investment performance because stock price fluctuation and profit rate may be different. Therefore, it is necessary to study the research using more stocks and the yield prediction through trading simulation.

주식 시장은 거래자들의 기업과 시황에 대한 기대가 반영되어 움직이기에, 다양한 원천의 텍스트 데이터 분석을 통해 주가 움직임을 예측하려는 연구들이 진행되어 왔다. 주가의 움직임을 예측하는 것이기에 단순히 주가의 등락 뿐만이 아니라, 뉴스 기사나 소셜 미디어의 반응에 따라 거래를 하고 이에 따른 수익률을 분석하는 연구들이 진행되어 왔다. 주가의 움직임을 예측하는 연구들도 다른 분야의 텍스트 마이닝 접근 방안과 동일하게 단어-문서 매트릭스를 구성하여 분류 알고리즘에 적용하여 왔다. 문서에 많은 단어들이 포함되어 있기 때문에 모든 단어를 가지고 단어-문서 매트릭스를 만드는 것보다는 단어가 문서를 범주로 분류할 때 기여도가 높은 단어들을 선정하여야 한다. 단어의 빈도를 고려하여 너무 적은 등장 빈도나 중요도를 보이는 단어는 제거하게 된다. 단어가 문서를 정확하게 분류하는 데 기여하는 정도를 측정하여 기여도에 따라 사용할 단어를 선정하기도 한다. 단어-문서 매트릭스를 구성하는 기본적인 방안인 분석의 대상이 되는 모든 문서를 수집하여 분류에 영향력을 미치는 단어를 선정하여 사용하는 것이었다. 본 연구에서는 개별 종목에 대한 문서를 분석하여 종목별 등락에 모두 포함되는 단어를 중립 단어로 선정한다. 선정된 중립 단어 주변에 등장하는 단어들을 추출하여 단어-문서 매트릭스 생성에 활용한다. 중립 단어 자체는 주가 움직임과 연관관계가 적고, 중립 단어의 주변 단어가 주가 상승에 더 영향을 미칠 것이라는 생각에서 출발한다. 생성된 단어-문서 매트릭스를 가지고 주가의 등락 여부를 분류하는 알고리즘에 적용하게 된다. 본 연구에서는 종목 별로 중립 단어를 1차 선정하고, 선정된 단어 중에서 다른 종목에도 많이 포함되는 단어는 추가적으로 제외하는 방안을 활용하였다. 온라인 뉴스 포털을 통해 시가 총액 상위 10개 종목에 대한 4개월 간의 뉴스 기사를 수집하였다. 3개월간의 뉴스 기사를 학습 데이터로 분류 모형을 수립하였으며, 남은 1개월간의 뉴스 기사를 모형에 적용하여 다음 날의 주가 움직임을 예측하였다. 본 연구에서 제안하는 중립 단어 활용 알고리즘이 희소성에 기반한 단어 선정 방안에 비해 우수한 분류 성과를 보였다.

Keywords

References

  1. Ahn, S. W and S. B. Cho, "Stock Prediction Using News Text Mining and Time Series Analysis", Proceedings of Korea Computer Congress, Vol.37, No.1(2010), 364-369
  2. Amilon, H., "GARCH estimation and discrete stock prices: an application to low-priced Australian stocks", Economics Letters, Vol.81, No.2(2003), 215-222. https://doi.org/10.1016/S0165-1765(03)00172-1
  3. Bothos, E., D. Apostolou, G. Mentzas, "Using Social Media to Predict Future Events with Agent-Based Markets", IEEE Intelligent Systems, Vol.25, No.6(2010), 50-58. https://doi.org/10.1109/MIS.2010.152
  4. Cao, Q., W. Duan, and Q. Gan, "Exploring determinants of voting for the "helpfulness" of online user reviews: A text mining approach", Decision Support Systems, Vol.50, No.2(2011), 511-521. https://doi.org/10.1016/j.dss.2010.11.009
  5. Choeh, J. Y., H. J. Lee, and S. J. Park, "A Personalized Approach for Recommending Useful Product Reviews Based on Information Gain", KSII Transactions on Internet and Information Systems, Vol.9, No.5(2015), 1702-1716. https://doi.org/10.3837/tiis.2015.05.008
  6. Ding, X., Y. Zhang, T. Liu, and J. Duan, "Using Structured Events to Predict Stock Price Movement: An Empirical Investigation", Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, 1415-1425.
  7. Ding, X., Y. Zhang, T. Liu, and J. Duan, "Deep Learning for Event-Driven Stock Prediction", Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 2015, 2327-2333.
  8. Fung, G. P. C., J. X. Yu, X. Yu and W. Lam, "News Sensitive Stock Trend Prediction", Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Taipei, Taiwan, 2002.
  9. Huang, A. "Similarity measures for text document clustering." Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008.
  10. Jeantheau, T., "A link between complete models with stochastic volatility and ARCH models," Finance and Stochastics, Vol. 8, No. 1(2004), 111-131. https://doi.org/10.1007/s00780-003-0103-6
  11. Jeong, J. S., D. S. Kim, and J. W. Kim, "Influence analysis of Internet buzz to corporate performance: Individual stock prediction using sentiment analysis of online news," Journal of Intelligence and Information Systems, Vol. 21, No. 4(2015), 37-51. https://doi.org/10.13088/JIIS.2015.21.4.037
  12. Kim, K. Y., and K. R. Lee, "A Study on the Prediction of Stock Price Using Artificial Intelligence System", Korean Journal of Business Administration, Vol.21, No.6 (2008), 2421-2449
  13. Kim, Y. S., N. G. Nim, and S. R. Jeong, "Stock-Index Invest Model Using News Big Data Opinion Mining," Journal of Intelligence and Information Systems, Vol. 18, No. 2(2012), 143-156. https://doi.org/10.13088/JIIS.2012.18.2.143
  14. Lee, H. Y., "A Combination Model of Multiple Artificial Intelligence Techniques Based on Genetic Algorithms for the Prediction of Korean Stock Price Index(KOSPI)", Entrue Journal of Information Technology, Vol.7, No.2(2008), 33-43.
  15. Lee, M. and H. J. Lee, "Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms", Journal of Intelligence and Information Systems, Vol. 22, No. 3(2016), 129-142. https://doi.org/10.13088/jiis.2016.22.3.129
  16. Liaw, A. and M. Wiener, "Classification and regression by randomForest", R News, 2(3), 18-22, 2002.
  17. Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. URL http://CRAN.R-project.org/package=e1071. R package version 1.6-1.
  18. Mittermayer, M. A., "Forecasting Intraday Stock Price Trends with Text Mining Technique", Proceedings of the 37th Hawaii International Conference on Social Systems, Hawaii, 2004.
  19. Oh, C. and O. R. L. Sheng, "Investigating Predictive Power of Stock Micro Blog Sentiment in Forecasting Future Stock Price Directional Movement", Proceedings of ICIS 2011, Shanghai, China.
  20. Park, K. H and H. J. Shin, "Stock Price Prediction Based on Time Series Network", Korean Management Science Review, Vol.28, No.1(2011), 53-60
  21. Perkins, J., Python 3 Text Processing with NLTK 3 Cookbook, Packt Publishing, 2014.
  22. Schumaker, R. P. and H. Chen, "Textual Analysis of Stock Market Prediction Using Breaking Financial News: The AZFinText System", ACM Transactions on Information Systems, Vol. 27, No. 2(2009), Article No. 12.
  23. Seo, Y. W., J. Giampapa and K. Sycara, "Text Classification for Intelligent Portfolio Management", Carnegie Mellon University, Robotics Institute, 2002.
  24. Thomas, J. D. and K. Sycara, "Integrating Genetic Algorithms and Text Learning for Financial Prediction", Proceedings of Genetic and Evolutionary Computation Conference (GECCO), Las Vegas, NV, 2002.
  25. Tumasjan, A., T. O. Sprenger, P. G. Sandner, I. M. Welpe, "Election Forecasts With Twitter", Social Science Computer Review, Vol. 29, Issue 4, 2011, 402-418. https://doi.org/10.1177/0894439310386557
  26. Tuszynski, J., caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc., 2012. URL http://CRAN.R-project.org/ package=caTools. R package version 1.13.
  27. Yu, E. J., Y. S. Kim, N. G. Kim, and S. R. Jeong, "Prediction the Direction of the Stock Index by Using a Domain-Specific Sentiment Dictionary," Journal of Intelligence and Information Systems, Vol. 19, No. 1(2013), 95-110. https://doi.org/10.13088/jiis.2013.19.1.095
  28. Zhang, R. and T. Tran, "An information gain-based approach for recommending useful product reviews", Knowledge Information Systems, Vol. 26, No. 3(2011), 419-434. https://doi.org/10.1007/s10115-010-0287-y