DOI QR코드

DOI QR Code

Comparison of System Call Sequence Embedding Approaches for Anomaly Detection

이상 탐지를 위한 시스템콜 시퀀스 임베딩 접근 방식 비교

  • Lee, Keun-Seop (Dept. of Knowledge Information Engineering, Graduate School of Ajou University) ;
  • Park, Kyungseon (Dept. of Knowledge Information Engineering, Graduate School of Ajou University) ;
  • Kim, Kangseok (Dept. of Cyber Security, Ajou University)
  • 이근섭 (아주대학교 대학원 지식정보공학과) ;
  • 박경선 (아주대학교 대학원 지식정보공학과) ;
  • 김강석 (아주대학교 사이버보안학과)
  • Received : 2021.12.27
  • Accepted : 2022.02.20
  • Published : 2022.02.28

Abstract

Recently, with the change of the intelligent security paradigm, study to apply various information generated from various information security systems to AI-based anomaly detection is increasing. Therefore, in this study, in order to convert log-like time series data into a vector, which is a numerical feature, the CBOW and Skip-gram inference methods of deep learning-based Word2Vec model and statistical method based on the coincidence frequency were used to transform the published ADFA system call data. In relation to this, an experiment was carried out through conversion into various embedding vectors considering the dimension of vector, the length of sequence, and the window size. In addition, the performance of the embedding methods used as well as the detection performance were compared and evaluated through GRU-based anomaly detection model using vectors generated by the embedding model as an input. Compared to the statistical model, it was confirmed that the Skip-gram maintains more stable performance without biasing a specific window size or sequence length, and is more effective in making each event of sequence data into an embedding vector.

최근 지능화된 보안 패러다임의 변화에 따라, 다양한 정보보안 시스템에서 발생하는 각종 정보를 인공지능 기반 이상탐지에 적용하기 위한 연구가 증가하고 있다. 따라서 본 연구는 로그와 같은 시계열 데이터를 수치형 특성인 벡터로 변환하기 위하여 딥러닝 기반 Word2Vec 모델의 CBOW와 Skip-gram 추론 방식과 동시발생 빈도 기반 통계 방식을 사용하여 공개된 ADFA 시스템콜 데이터에 대하여, 벡터의 차원, 시퀀스 길이 및 윈도우 사이즈를 고려한 다양한 임베딩 벡터로의 변환에 대한 실험을 진행하였다. 또한 임베딩 모델로 생성된 벡터를 입력으로 하는 GRU 기반 이상 탐지 모델을 통해 탐지 성능뿐만 아니라 사용된 임베딩 방법들의 성능을 비교 평가하였다. 통계 모델에 비해 추론 기반 모델인 Skip-gram이 특정 윈도우 사이즈나 시퀀스 길이에 치우침 없이 좀 더 안정되게(stable) 성능을 유지하여, 시퀀스 데이터의 각 이벤트들을 임베딩 벡터로 만드는데 더 효과적임을 확인하였다.

Keywords

Acknowledgement

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT: Ministry of Science and ICT) (No. NRF-2019R1F1A1059036).

References

  1. G. Creech & J. Hu. (2013). Generation of a new IDS test dataset: Time to retire the KDD collection. IEEE WCNC(Wireless Communications and Networking Conference). DOI : 10.1109/WCNC.2013.6555301
  2. K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk & Yoshua Bengio. (2014). Learning phrase Representations using RNN encoder-decoder for statistical machine translation. EMNLP, 1724-1734. arXiv:1406.1078. https://arxiv.org/pdf/1406.1078.pdf
  3. M. Xie & J. Hu. (2013). Evaluating host-based anomaly detection systems: a preliminary analysis of ADFA-LD. 6th IEEE International Congress on Image and Signal Processing (CISP '03), 1711-1716. DOI : 10.1109/CISP.2013.6743952
  4. G. Creech, & J. Hu. (2014). A semantic approach to host-based intrusion detection systems using contiguous and discontiguous system call patterns. IEEE Transactions on Computers, 63(4). DOI : 10.1109/TC.2013.13
  5. S. A. Maske & T. J. Parvat. (2016. Aug.). Advanced anomaly intrusion detection technique for host based system using system call patterns. International Conference on Inventive Computation Technologies (ICICT). Coimbatore, India. DOI : 10.1109/INVENTIVE.2016.7824846
  6. E. Aghaei. (2017). Machine learning for host-based misuse and anomaly detection in UNIX environment. Master Thesis, Computer Science in University of Toledo. DOI : 10.13140/RG.2.2.19382.73283
  7. B. Borisaniya & D. Patel. (2015). Evaluation of modified vector space representation using ADFA-LD and ADFA-WD datasets. Journal of Information Security, 6(3), 250-264. DOI : 10.4236/jis.2015.63025
  8. D. Kwon, K. Natarajan, S. C. Suh, H. Kim & J. Kim. (2018. July). An empirical study on network anomaly detection using convolutional neural networks. Proceedings of IEEE 38th International Conference Distributed Computing Systems(ICDCS), 1595-1598. DOI: 10.1109/ICDCS.2018.00178
  9. Canadian Institute for Cybersecurit. (n. d.). NSL-KDD Dataset. UNB(Online). https://www.unb.ca/cic/datasets/nsl.html
  10. Y. Fu, F. Lou, F. Meng, Z. Tian, H. Zhang & F. Jiang. (2018. June). An intelligent network attack detection method based on RNN. Proceedings of IEEE 3rd International Conference Data Science Cyberspace (DSC), 483-489. DOI : 10.1109/DSC.2018.00078
  11. C. Kim, M. Jang, S. Seo, K. Park & P. Kang. (2021). Intrusion detection based on sequential information preserving log embedding methods and anomaly detection algorithms. IEEE Access, 9, 58088-58101. DOI : 10.1109/ACCESS.2021.3071763
  12. T. Mikolov, K. Chen, G. Corrado & J. Dean. (2013). Efficient estimation of word representations in vector space. ICLR. arXiv:1301.3781v3. https://arxiv.org/pdf/1301.3781.pdf
  13. T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J. Dean. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (NIPS). https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
  14. A. Vaswan, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser & I. Polosukhin. (2017). Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS). arXiv:1706.03762v5