DOI QR코드

DOI QR Code

An Approach to Applying Multiple Linear Regression Models by Interlacing Data in Classifying Similar Software

  • Lim, Hyun-il (Dept. of Computer Engineering, Kyungnam University)
  • Received : 2020.09.15
  • Accepted : 2021.03.26
  • Published : 2022.04.30

Abstract

The development of information technology is bringing many changes to everyday life, and machine learning can be used as a technique to solve a wide range of real-world problems. Analysis and utilization of data are essential processes in applying machine learning to real-world problems. As a method of processing data in machine learning, we propose an approach based on applying multiple linear regression models by interlacing data to the task of classifying similar software. Linear regression is widely used in estimation problems to model the relationship between input and output data. In our approach, multiple linear regression models are generated by training on interlaced feature data. A combination of these multiple models is then used as the prediction model for classifying similar software. Experiments are performed to evaluate the proposed approach as compared to conventional linear regression, and the experimental results show that the proposed method classifies similar software more accurately than the conventional model. We anticipate the proposed approach to be applied to various kinds of classification problems to improve the accuracy of conventional linear regression.

Keywords

References

  1. H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K. Matsumoto, "Dynamic software birthmarks to detect the theft of windows applications," in Proceedings of International Symposium on Future Software Technology (ISFST), Xian, China, 2004.
  2. S. Cesare, "Software similarity and classification," Ph.D. dissertation, Deakin University, Geelong, Australia, 2013.
  3. H. Park, H. I. Lim, S. Choi, and T. Han, "Detecting common modules in Java packages based on static object trace birthmark," The Computer Journal, vol. 54, no. 1, pp. 108-124, 2011. https://doi.org/10.1093/comjnl/bxp095
  4. G. Myles and C. Collberg, "Detecting software theft via whole program path birthmarks," in Information Security. Heidelberg, Germany: Springer, 2004, pp. 404-415
  5. H. Tamada, M. Nakamura, A. Monden, and K. I. Matsumoto, "Java birthmarks: detecting the software theft," IEICE Transactions on Information and Systems, vol. 88, no. 9, pp. 2148-2158, 2005.
  6. M. Alazab, R. Layton, S. Venkataraman, and P. Watters, "Malware detection based on structural and behavioural features of API calls," in Proceedings of the 1st International Cyber Resilience Conference, Perth, Australia, 2010, pp. 1-10.
  7. K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press, 2012.
  8. S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge, UK: Cambridge University Press, 2014.
  9. P. Domingos, "A few useful things to know about machine learning," Communications of the ACM, vol. 55, no. 10, pp. 78-87, 2012. https://doi.org/10.1145/2347736.2347755
  10. D. T. Ramotsoela, G. P. Hancke, and A. M. Abu-Mahfouz, "Attack detection in water distribution systems using machine learning," Human-centric Computing and Information Sciences, vol. 9, article no. 13, 2019. https://doi.org/10.1186/s13673-019-0175-8
  11. D. H. Kwon, J. B. Kim, J. S. Heo, C. M. Kim, and Y. H. Han, "Time series classification of cryptocurrency price trend based on a recurrent LSTM neural network," Journal of Information Processing Systems, vol. 15, no. 3, pp. 694-706, 2019. https://doi.org/10.3745/JIPS.03.0120
  12. M. J. J. Ghrabat, G. Ma, I. Y. Maolood, S. S. Alresheedi, and Z. A. Abduljabbar, "An effective image retrieval based on optimized genetic algorithm utilized a novel SVM-based convolutional neural network classifier," Human-centric Computing and Information Sciences, vol. 9, article no. 31, 2019. https://doi.org/10.1186/s13673-019-0191-8
  13. C. Cicceri, F. De Vita, D. Bruneo, G. Merlino, and A. Puliafito, "A deep learning approach for pressure ulcer prevention using wearable computing," Human-centric Computing and Information Sciences, vol. 10, article no. 5, 2020. https://doi.org/10.1186/s13673-020-0211-8
  14. H. I. Lim, "A linear regression approach to modeling software characteristics for classifying similar software," in Proceedings of 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, 2019, pp. 942-943.
  15. W. Liu, P. Wang, Y. Meng, C. Zhao, and Z. Zhang, "Cloud spot instance price prediction using kNN regression," Human-centric Computing and Information Sciences, vol. 10, article no. 34, 2020. https://doi.org/10.1186/s13673-020-00239-5
  16. W. Li, X. Li, M. Yao, J. Jiang, and Q. Jin, "Personalized fitting recommendation based on support vector regression," Human-centric Computing and Information Sciences, vol. 5, article no. 21, 2015. https://doi.org/10.1186/s13673-015-0041-2
  17. H. I. Lim, "Design of similar software classification model through support vector machine," Journal of Digital Contents Society, vol. 21, no. 3, pp. 569-577, 2020. https://doi.org/10.9728/dcs.2020.21.3.569
  18. M. J. Ding, S. Z. Zhang, H. D. Zhong, Y. H. Wu, and L. B. Zhang, "A prediction model of the sum of container based on combined BP neural network and SVM," Journal of Information Processing Systems, vol. 15, no. 2, pp. 305-319, 2019. https://doi.org/10.3745/JIPS.04.0107
  19. M. Zouina and B. Outtaj, "A novel lightweight URL phishing detection system using SVM and similarity index," Human-centric Computing and Information Sciences, vol. 7, article no. 17, 2017. https://doi.org/10.1186/s13673-017-0098-1
  20. N. Shalev and N. Partush, "Binary similarity detection using machine learning," in Proceedings of the 13th Workshop on Programming Languages and Analysis for Security, Toronto, Canada, 2018, pp. 42-47.
  21. M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, "Deep learning code fragments for code clone detection," in Proceedings of 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), Singapore, 2016, pp. 87-98.
  22. D. Heres, "Source code plagiarism detection using machine learning," Master's thesis, Utrecht University, Utrecht, Netherlands, 2017.
  23. U. Bandara and G. Wijayarathna, "A machine learning based tool for source code plagiarism detection," International Journal of Machine Learning and Computing, vol. 1, no. 4, pp. 337-343, 2011.
  24. N. Marastoni, R. Giacobazzi, and M. Dalla Preda, "A deep learning approach to program similarity," in Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis, Montpellier, France, 2018, pp. 26-35.
  25. Python programming language [Online]. Available: https://www.python.org/.
  26. Scikit-learn: machine learning in Python [Online]. Available: http://scikit-learn.org/stable/index.html.
  27. ANTLR (ANother Tool for Language Recognition) [Online]. Available: https://www.antlr.org/.
  28. The Apache Jakarta Project [Online]. Available: https://jakarta.apache.org/oro/.
  29. L. Prechelt, G. Malpohl, and M. Philippsen, "Finding plagiarisms among a set of programs with JPlag," Journal of Universal Computer Science, vol. 8, no. 11, pp. 1016-1038, 2002.
  30. G. Myles and C. Collberg, "K-gram based software birthmarks," in Proceedings of the 2005 ACM Symposium on Applied Computing, Santa Fe, NM, 2005, pp. 314-318.
  31. B. Lu, F. Liu, X. Ge, B. Liu, and X. Luo, "A software birthmark based on dynamic opcode n-gram," in Proceedings of the International Conference on Semantic Computing (ICSC), Irvine, CA, 2007, pp. 37-44.
  32. L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, "Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection," IEEE Transactions on Software Engineering, vol. 43, no. 12, pp. 1157-1177, 2017. https://doi.org/10.1109/tse.2017.2655046
  33. Y. C. Jhi, X. Jia, X. Wang, S. Zhu, P. Liu, and D. Wu, "Program characterization using runtime values and its application to software plagiarism detection," IEEE Transactions on Software Engineering, vol. 41, no. 9, pp. 925-943, 2015. https://doi.org/10.1109/TSE.2015.2418777
  34. F. Zhang, D. Wu, P. Liu, and S. Zhu, "Program logic based software plagiarism detection," in Proceedings of 2014 IEEE 25th International Symposium on Software Reliability Engineering, Naples, Italy, 2014, pp. 66-77.