DOI QR코드

DOI QR Code

Investigating Opinion Mining Performance by Combining Feature Selection Methods with Word Embedding and BOW (Bag-of-Words)

속성선택방법과 워드임베딩 및 BOW (Bag-of-Words)를 결합한 오피니언 마이닝 성과에 관한 연구

  • Eo, Kyun Sun (SKK Business School, Sungkyunkwan University) ;
  • Lee, Kun Chang (Global Business Administration/Dept of Health Sciences & & Technology, SHAIHST Sungkyunkwan University)
  • 어균선 (성균관대학교 경영대학) ;
  • 이건창 (성균관대학교 글로벌경영학과/삼성융합의과학원 융합의과학과)
  • Received : 2018.11.13
  • Accepted : 2019.02.20
  • Published : 2019.02.28

Abstract

Over the past decade, the development of the Web explosively increased the data. Feature selection step is an important step in extracting valuable data from a large amount of data. This study proposes a novel opinion mining model based on combining feature selection (FS) methods with Word embedding to vector (Word2vec) and BOW (Bag-of-words). FS methods adopted for this study are CFS (Correlation based FS) and IG (Information Gain). To select an optimal FS method, a number of classifiers ranging from LR (logistic regression), NN (neural network), NBN (naive Bayesian network) to RF (random forest), RS (random subspace), ST (stacking). Empirical results with electronics and kitchen datasets showed that LR and ST classifiers combined with IG applied to BOW features yield best performance in opinion mining. Results with laptop and restaurant datasets revealed that the RF classifier using IG applied to Word2vec features represents best performance in opinion mining.

과거 10년은 웹의 발달로 인한 데이터가 폭발적으로 생성되었다. 데이터마이닝에서는 대용량의 데이터에서 무의미한 데이터를 구분하고 가치 있는 데이터를 추출하는 단계가 중요한 부분을 차지한다. 본 연구는 감성분석을 위한 재표현 방법과 속성선택 방법을 적용한 오피니언 마이닝 모델을 제안한다. 본 연구에서 사용한 재표현 방법은 백 오즈 워즈(Bag-of-words)와 Word embedding to vector(Word2vec)이다. 속성선택(Feature selection) 방법은 상관관계 기반 속성선택(Correlation based feature selection), 정보획득 속성선택(Information gain)을 사용했다. 본 연구에서 사용한 분류기는 로지스틱 회귀분석(Logistic regression), 인공신경망(Neural network), 나이브 베이지안 네트워크(naive Bayesian network), 랜덤포레스트(Random forest), 랜덤서브스페이스(Random subspace), 스태킹(Stacking)이다. 실증분석 결과, electronics, kitchen 데이터 셋에서는 백 오즈 워즈의 정보획득 속성선택의 로지스틱 회귀분석과 스태킹이 높은 성능을 나타냄을 확인했다. laptop, restaurant 데이터 셋은 Word2vec의 정보획득 속성선택을 적용한 랜덤포레스트가 가장 높은 성능을 나타내는 조합이라는 것을 확인했다. 다음과 같은 결과는 오피니언 마이닝 모델 구축에 있어서 모델의 성능을 향상시킬 수 있음을 나타낸다.

Keywords

DJTJBT_2019_v17n2_163_f0001.png 이미지

Fig. 1. Word2vec

DJTJBT_2019_v17n2_163_f0002.png 이미지

Fig. 2. Procedures

Table 1. Opinion mining studies

DJTJBT_2019_v17n2_163_t0001.png 이미지

Table 2. BOW results

DJTJBT_2019_v17n2_163_t0002.png 이미지

Table 3. WE results

DJTJBT_2019_v17n2_163_t0003.png 이미지

References

  1. M. Kang, J. Ahn & K. Lee. (2018). Opinion mining using ensemble text hidden Markov models for text classification. Expert Systems with Applications, 94, 218-227. https://doi.org/10.1016/j.eswa.2017.07.019
  2. J. R. Pineiro-Chousa, M. A. Lopez-Cabarcos & A. M. Perez-Pico. (2016). Examining the influence of stock market variables on microblogging sentiment. Journal of Business Research, 69(6), 2087-2092. https://doi.org/10.1016/j.jbusres.2015.12.013
  3. A. Yadollahi, A. G. Shahraki & O. R. Zaiane. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Computing Surveys (CSUR), 50(2), 25.
  4. M. Y. Chen & T. H. Chen. (2017). Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena. Future Generation Computer Systems.
  5. T. Mikolov, K. Chen, G. Corrado & J. Dean. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  6. L. P. Ni, Z. W. Ni & Y. Z. Gao. (2011). Stock trend prediction based on fractal feature selection and support vector machine. Expert Systems with Applications, 38(5), 5569-5576. https://doi.org/10.1016/j.eswa.2010.10.079
  7. Y. Liu, J. W. Bi & Z. P. Fan. (2017). Multi-class sentiment classification: The experimental comparisons of feature selection and machine learning algorithms. Expert Systems with Applications, 80, 323-339. https://doi.org/10.1016/j.eswa.2017.03.042
  8. F. Corea. (2016). Can Twitter Proxy the Investors' Sentiment? The Case for the Technology Sector. Big Data Research, 4, 70-74. https://doi.org/10.1016/j.bdr.2016.05.001
  9. Y. Ruan, A. Durresi & L. Alfantoukh. (2018). Using Twitter trust network for stock market analysis. Knowledge-Based Systems, 145, 207-218. https://doi.org/10.1016/j.knosys.2018.01.016
  10. M. Ghiassi, J. Skinner & D. Zimbra. (2013). Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with applications, 40(16), 6266-6282. https://doi.org/10.1016/j.eswa.2013.05.057
  11. N. F. Da Silva, E. R. Hruschka & E. R. Hruschka Jr. (2014). Tweet sentiment analysis with classifier ensembles. Decision Support Systems, 66, 170-179. https://doi.org/10.1016/j.dss.2014.07.003
  12. G. Wang, J. Sun, J. Ma, K. Xu & J. Gu. (2014). Sentiment classification: The contribution of ensemble learning. Decision support systems, 57, 77-93. https://doi.org/10.1016/j.dss.2013.08.002
  13. S. Yoo, J. Song & O. Jeong. (2018). Social media contents based sentiment analysis and prediction system. Expert Systems with Applications, 105, 102-111. https://doi.org/10.1016/j.eswa.2018.03.055
  14. A. Garcia-Pablos, M. Cuadros & G. Rigau. (2018). W2vlda: almost unsupervised system for aspect based sentiment analysis. Expert Systems with Applications, 91, 127-137. https://doi.org/10.1016/j.eswa.2017.08.049
  15. S. Menard. (2002). Applied logistic regression analysis, 106, Sage.
  16. R. J. Schalkoff. Artificial neural networks, 1, New York: McGraw-Hill.
  17. N. Friedman, D. Geiger & M. Goldszmidt. (1997). Bayesian network classifiers. Machine learning, 29(2-3), 131-163. https://doi.org/10.1023/A:1007465528199
  18. L. Breiman. (2001). Random forests. Machine learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
  19. T. K. Ho. (1998). The Random Subspace Method for Constructing Decision Forests, IEEE Trans. Pattern Analysis and Machine Intelligence, 20(8), 832-844. https://doi.org/10.1109/34.709601
  20. D. H. Wolpert. (1992). Stacked generalization. Neural networks, 5(2), 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1
  21. J. Blitzer, M. Dredze & F. Pereira. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, (pp. 440-447).
  22. S. Poria, E. Cambria & A. Gelbukh. (2016). Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems, 108, 42-49. https://doi.org/10.1016/j.knosys.2016.06.009