Topic Automatic Extraction Model based on Unstructured Security Intelligence Report

비정형 보안 인텔리전스 보고서 기반 토픽 자동 추출 모델

  • Hur, YunA (Department of Computer Science and Egineering, Korea University) ;
  • Lee, Chanhee (Department of Computer Science and Egineering, Korea University) ;
  • Kim, Gyeongmin (Department of Computer Science and Egineering, Korea University) ;
  • Lim, HeuiSeok (Department of Computer Science and Egineering, Korea University)
  • 허윤아 (고려대학교 컴퓨터학과) ;
  • 이찬희 (고려대학교 컴퓨터학과) ;
  • 김경민 (고려대학교 컴퓨터학과) ;
  • 임희석 (고려대학교 컴퓨터학과)
  • Received : 2019.04.24
  • Accepted : 2019.06.20
  • Published : 2019.06.28


As cyber attack methods are becoming more intelligent, incidents such as security breaches and international crimes are increasing. In order to predict and respond to these cyber attacks, the characteristics, methods, and types of attack techniques should be identified. To this end, many security companies are publishing security intelligence reports to quickly identify various attack patterns and prevent further damage. However, the reports that each company distributes are not structured, yet, the number of published intelligence reports are ever-increasing. In this paper, we propose a method to extract structured data from unstructured security intelligence reports. We also propose an automatic intelligence report analysis system that divides a large volume of reports into sub-groups based on their topics, making the report analysis process more effective and efficient.


Security;Intelligence Report;Analysis;Topic Modeling;Classification

OHHGBW_2019_v10n6_33_f0001.png 이미지

Fig. 1. Example of problematic PDF file whenextracting text

OHHGBW_2019_v10n6_33_f0002.png 이미지

Fig. 2. Topic Modeling based on Security Intelligence Report

OHHGBW_2019_v10n6_33_f0003.png 이미지

Fig. 3. Result of putting test document in TopicModeling

Table 1. When a PDF document is simply extracted as text

OHHGBW_2019_v10n6_33_t0001.png 이미지

Table 2. This is an example of extracting the same PDF document by the method developed in this task

OHHGBW_2019_v10n6_33_t0002.png 이미지

Table 3. Topic by bag-of- words

OHHGBW_2019_v10n6_33_t0003.png 이미지

Table 4. Security Intelligence Report Topic Automatic Extraction Model Satisfaction Evaluation Question

OHHGBW_2019_v10n6_33_t0004.png 이미지

Table 5. Security Intelligence Report Topic Automatic Extraction Model satisfaction

OHHGBW_2019_v10n6_33_t0005.png 이미지


Supported by : Korea Creative Content Agency(KOCCA)


  1. B. I. Kang, M. Song, W. Jho. (2013). A Study on Opinion Mining of News paper Texts based on Topic Modeling. Journal of The Korean Society For Library And Information Science, 47(4), 315-334.
  2. J. H. Bae, N. G. Han & M. Song (2014). Twitter Issue Tracking System by Topic Modeling Techniques. Journal of Intelligence and Information System, 20(20), 109-122.
  3. H. G Kim, S. U. Kim & S. T. Kim. (2018). Topic Modeling of Media Reports on Smartphone Addiction - A Study on the Comparison of Government Policies between 2010 and 2018. Korean Association for Braodcasting & Telecommunication Studies, 104, 38-62.
  4. N. Potha & E. Stamatatos. (2019). Improving author verification based on topic modeling. Journal of the Association for Information Science and Technology, 0(0), 1-15. DOI :10.1002/asi.24183
  5. H. H. Gill. (2018) The Study of Korean Stopwords list for Textmining, URIMALGEUL: The Korean Language and Literature, 78, 1-25.
  6. H. M. Wallach. (2006). Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machinelearning(ACM), 977-984.
  7. J. Yang, Y. G. Jiang, A. G. Hauptmann & C. W. Ngo. (2007). Evaluating bag-of-visual-words representations in scene classification. In Proceedings of the international workshop on Workshop on multimedia information retrieval(ACM), 197-206.
  8. D. M. Blei, A. Y. Ng & M. I. Jordan. (2003). Latent Dirichlet Allocation, Journal of Machine Learning Research, 3(Jan), 993-1022. DOI: 10.1162/jmlr.2003.3.4.-5.993
  9. Y. Guo, S. J. Barnes & Q. Jia. (2017). Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation, Tourism Management, 59, 467-483.
  10. S. Y. Lee. (2018. 06. 18). Microsoft Announces Cyber Security Threat Report. News of SecuN, p. 1.
  11. T. K. Kim &H. R Choi &H. C. Lee. (2016). A Study on the Research Trends inFintech using Topic Modeling. Journal of the Korea Academia-Industrial cooperation Society, 7(11), 670-681. DOI :10.5762/KAIS.2016.17.11.670
  12. L. Hong & B. D. Davison. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop onsocial media analytics(ACM), 80-88.
  13. N. C. Ho .(2016). An Illustrative Application of Topic Modeling Method to a Farmer's Diary. INSTITUTE OFCROSS-CULTURAL STUDIES, 22(1), 89-135.
  14. R. Krestel, P. Fankhauser & W. Nejdl. (2009, October). Latentdirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems, 61-68.
  15. Y. A Hur, D. Y. Lee, K. K. Kim, W. H. Yu & H. S. Lim. (2017). A System for Automatic Classification of Traditional Culture Texts. Journal of the Korea Convergence Society, 8(12), 39-47.