DOI QR코드

DOI QR Code

Semantic Visualization of Dynamic Topic Modeling

다이내믹 토픽 모델링의 의미적 시각화 방법론

  • Yeon, Jinwook (Graduate School of Business IT, Kookmin University) ;
  • Boo, Hyunkyung (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (Graduate School of Business IT, Kookmin University)
  • 연진욱 (국민대학교 비즈니스IT전문대학원) ;
  • 부현경 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT전문대학원)
  • Received : 2021.12.08
  • Accepted : 2022.01.20
  • Published : 2022.03.31

Abstract

Recently, researches on unstructured data analysis have been actively conducted with the development of information and communication technology. In particular, topic modeling is a representative technique for discovering core topics from massive text data. In the early stages of topic modeling, most studies focused only on topic discovery. As the topic modeling field matured, studies on the change of the topic according to the change of time began to be carried out. Accordingly, interest in dynamic topic modeling that handle changes in keywords constituting the topic is also increasing. Dynamic topic modeling identifies major topics from the data of the initial period and manages the change and flow of topics in a way that utilizes topic information of the previous period to derive further topics in subsequent periods. However, it is very difficult to understand and interpret the results of dynamic topic modeling. The results of traditional dynamic topic modeling simply reveal changes in keywords and their rankings. However, this information is insufficient to represent how the meaning of the topic has changed. Therefore, in this study, we propose a method to visualize topics by period by reflecting the meaning of keywords in each topic. In addition, we propose a method that can intuitively interpret changes in topics and relationships between or among topics. The detailed method of visualizing topics by period is as follows. In the first step, dynamic topic modeling is implemented to derive the top keywords of each period and their weight from text data. In the second step, we derive vectors of top keywords of each topic from the pre-trained word embedding model. Then, we perform dimension reduction for the extracted vectors. Then, we formulate a semantic vector of each topic by calculating weight sum of keywords in each vector using topic weight of each keyword. In the third step, we visualize the semantic vector of each topic using matplotlib, and analyze the relationship between or among the topics based on the visualized result. The change of topic can be interpreted in the following manners. From the result of dynamic topic modeling, we identify rising top 5 keywords and descending top 5 keywords for each period to show the change of the topic. Existing many topic visualization studies usually visualize keywords of each topic, but our approach proposed in this study differs from previous studies in that it attempts to visualize each topic itself. To evaluate the practical applicability of the proposed methodology, we performed an experiment on 1,847 abstracts of artificial intelligence-related papers. The experiment was performed by dividing abstracts of artificial intelligence-related papers into three periods (2016-2017, 2018-2019, 2020-2021). We selected seven topics based on the consistency score, and utilized the pre-trained word embedding model of Word2vec trained with 'Wikipedia', an Internet encyclopedia. Based on the proposed methodology, we generated a semantic vector for each topic. Through this, by reflecting the meaning of keywords, we visualized and interpreted the themes by period. Through these experiments, we confirmed that the rising and descending of the topic weight of a keyword can be usefully used to interpret the semantic change of the corresponding topic and to grasp the relationship among topics. In this study, to overcome the limitations of dynamic topic modeling results, we used word embedding and dimension reduction techniques to visualize topics by era. The results of this study are meaningful in that they broadened the scope of topic understanding through the visualization of dynamic topic modeling results. In addition, the academic contribution can be acknowledged in that it laid the foundation for follow-up studies using various word embeddings and dimensionality reduction techniques to improve the performance of the proposed methodology.

최근 방대한 양의 텍스트 데이터에 대한 분석을 통해 유용한 지식을 창출하는 시도가 꾸준히 증가하고 있으며, 특히 토픽 모델링(Topic Modeling)을 통해 다양한 분야의 여러 이슈를 발견하기 위한 연구가 활발히 이루어지고 있다. 초기의 토픽 모델링은 토픽의 발견 자체에 초점을 두었지만, 점차 시기의 변화에 따른 토픽의 변화를 고찰하는 방향으로 연구의 흐름이 진화하고 있다. 특히 토픽 자체의 내용, 즉 토픽을 구성하는 키워드의 변화를 수용한 다이내믹 토픽 모델링(Dynamic Topic Modeling)에 대한 관심이 높아지고 있지만, 다이내믹 토픽 모델링은 분석 결과의 직관적인 이해가 어렵고 키워드의 변화가 토픽의 의미에 미치는 영향을 나타내지 못한다는 한계를 갖는다. 본 논문에서는 이러한 한계를 극복하기 위해 다이내믹 토픽 모델링과 워드 임베딩(Word Embedding)을 활용하여 토픽의 변화 및 토픽 간 관계를 직관적으로 해석할 수 있는 방안을 제시한다. 구체적으로 본 연구에서는 다이내믹 토픽 모델링 결과로부터 각 시기별 토픽의 상위 키워드와 해당 키워드의 토픽 가중치를 도출하여 정규화하고, 사전 학습된 워드 임베딩 모델을 활용하여 각 토픽 키워드의 벡터를 추출한 후 각 토픽에 대해 키워드 벡터의 가중합을 산출하여 각 토픽의 의미를 벡터로 나타낸다. 또한 이렇게 도출된 각 토픽의 의미 벡터를 2차원 평면에 시각화하여 토픽의 변화 양상 및 토픽 간 관계를 표현하고 해석한다. 제안 방법론의 실무 적용 가능성을 평가하기 위해 DBpia에 2016년부터 2021년까지 공개된 논문 중 '인공지능' 관련 논문 1,847건에 대한 실험을 수행하였으며, 실험 결과 제안 방법론을 통해 다양한 토픽이 시간의 흐름에 따라 변화하는 양상을 직관적으로 파악할 수 있음을 확인하였다.

Keywords

References

  1. Bae, J. H., J. E. Son, and M. Song, "Analysis of Twitter for 2012 South Korea Presidential Election by Text mining Techniques," Journal of Intelligence and Information Systems, Vol.19, No.3(2013), 141~156. https://doi.org/10.13088/JIIS.2013.19.3.141
  2. Bae, J. H., N. G. Han, and M. Song, "Twitter Issue Tracking System by Topic Modeling Techniques," Journal of Intelligence and Information Systems, Vol.20, No.2(2014), 109~122. https://doi.org/10.13088/JIIS.2014.20.2.109
  3. Blei, D. M., A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol.3, No.4-5(2003), 993~1022.
  4. Blei, D. M. and J. D. Lafferty, "Dynamic Topic Models," Proceedings of the 23rd International Conference on Machine Learning, (2006), 113~120.
  5. Devlin, J., M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2019), 4171~4186.
  6. Ding, W. and C. Chen, "Dynamic Topic Detection and Tracking: A comparison of HDP, C-word, and Cocitation Methods," Journal of the Association for Information Science and Technology, Vol.65, No.10(2014), 2084~2097. https://doi.org/10.1002/asi.23134
  7. Hunter, J. D., "Matplotlib: A 2D Graphics Environment," Computing in Science and Engineering, Vol.9, No.3(2007), 90~95. https://doi.org/10.1109/MCSE.2007.55
  8. Joulin, A., E. Grave, P. Bojanowski, M. Douze, H. Jegou, and T. Mikolov, "Fasttext. Zip: Compressing Text Classification Models," arXiv pre,print arXiv:1612.03651, (2016).
  9. Lim, M. and N. Kim, "Analyzing the Issue Life Cycle by Mapping Inter-Period Issues," Journal of Intelligence and Information Systems, Vol.20, No.4(2014), 25~41. https://doi.org/10.13088/JIIS.2014.20.4.25
  10. Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and Their Compositionality," Proceedings of the 27th Conference on Neural Information Processing Systems, (2013), 3111~3119.
  11. Papadimitriou, C. H., P. Raghavan, H. Tamaki, and S. Vempala, "Latent Semantic Indexing : A Probabilistic Analysis," Proceedings of the 17th ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, (1998), 159~168.
  12. Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language Models are Unsu pervised Multitask Learners, OpenAI blog, 2019. Available at https://openai.com/blog/better-language-models/ (Accessed 9 November, 2021).
  13. Sha, H., M. A. Hasan, G. Mohler, and P. J. Brantingham, "Dynamic Topic Modeling of the COVID-19 Twitter Narrative Among US Governors and Cabinet Executives," arXiv preprint arXiv:2004.11692, (2020).
  14. Stancin, I. and A. Jovic, "An Overview and Comparison of Free Python Libraries for Data Mining and Big Data Analysis," Proceedings of the 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics, (2019), 977~982.
  15. Van der Maaten, L. and G. Hinton, "Visualizing Data Using t-SNE," Journal of Machine Learning Research, Vol.9, No.11(2008), 2579~2605.
  16. Waskom, M. L., "Seaborn: Statistical Data Visualization," Journal of Open Source Software, Vol. 6, No.60(2021), 3021. https://doi.org/10.21105/joss.03021
  17. Wold, S., K. Esbensen, and P. Geladi, "Principal Component Analysis," Chemometrics and Intelligent Laboratory Systems, Vol.2, (1987), 37~52. https://doi.org/10.1016/0169-7439(87)80084-9