Density-based Outlier Detection for Very Large Data

대용량 자료 분석을 위한 밀도기반 이상치 탐지

  • 김승 (서울산업대학교 산업공학과) ;
  • 조남욱 (서울산업대학교산업정보시스템공학과) ;
  • 강석호 (서울산업대학교 산업공학과)
  • Received : 2010.03.30
  • Accepted : 2010.05.13
  • Published : 2010.06.30

Abstract

A density-based outlier detection such as an LOF (Local Outlier Factor) tries to find an outlying observation by using density of its surrounding space. In spite of several advantages of a density-based outlier detection method, the computational complexity of outlier detection has been one of major barriers in its application. In this paper, we present an LOF algorithm that can reduce computation time of a density based outlier detection algorithm. A kd-tree indexing and approximated k-nearest neighbor search algorithm (ANN) are adopted in the proposed method. A set of experiments was conducted to examine performance of the proposed algorithm. The results show that the proposed method can effectively detect local outliers in reduced computation time.

Keywords

References

  1. Aggarwal, C.C. and P.S. Yu, "Outlier detection for high dimensional data," Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, New York: ACM Press, (2001), pp.37-46.
  2. Agrawal, R., J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proceeding of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, New York: ACM Press, (1998), pp.94-105.
  3. Agyemang, M. and C.I. Ezeife, "LSC- Mine: Algorithm for Mining Local Outliers," Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, Hershey, PA U.S.A. : IRM press, Vol.1(2004), pp.5-8.
  4. Arya, S., D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu, "An optimal algo rithm for approximate nearest neighbor searching," Journal of the ACM, Vol.45, No.6 (1998), pp.891-923. https://doi.org/10.1145/293347.293348
  5. Asuncion, A. and D.J. Newman, "UCI Machine Learning Repository," [http://www.ics.uci .edu/-mlearn/MLRepository.html], Irvine, CA : University of California, School of Information and Computer Science, 2007.
  6. Bentley, J.L., "K -d trees for semidynamic point sets," Proceedings of 6th Annual ACM Symposium Computational Geometry, New York : ACM Press, 1990, pp.187-197.
  7. Breunig, M.M., H.P. Kriegel, R.T. Ng, and J. Sander, "LOF : Identifying Density Based Local Outliers," Proceedings of the ACM SIGMOD Conference, Dallas, Texas, New York : ACM, (2000), pp.93-104.
  8. Ester, M.M., H.P. Kriegel, J. Sander and X. Xu, "A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proceeding for 2nd International Conference on Knowledge Discovery and Data Mining(KDD '96), Portland, Oregon, AAAI Press, (1996), pp.226-231.
  9. Ezawa, K.J. and S.W. Norton, "Constructing Bayesian Networks to predict Uncollectible Telecommunications Accounts," IEEE Expert, Vol.11, No.5(1996), pp.45-51. https://doi.org/10.1109/64.539016
  10. Faloutsos, C. and K1. Lin, "FastMap : A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets," Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, New York: ACM Press, Vol.24, No.2(1995), pp.163-174.
  11. Friedman, J.H., J.L. Bentley, and R.A. Finkel, "An algorithm for finding best matches in logarithmic expected time," ACM Transaction on Mathematical Software, Vol.3, No.3 (1977), pp.209-226. https://doi.org/10.1145/355744.355745
  12. Friedman, J,H., J.L. Bentley, and R.A. Finkel, "An algorithm for finding best matches in logarithmic expected time," ACM Transaction on Mathematical Software, Vol.3, No.3 (1977), pp.209-226. https://doi.org/10.1145/355744.355745
  13. Fukunaga, K., Introduction to Statistical Pattern Recognition, Academic Press, 2nd edition, 1977.
  14. Guha, S., R. Rastogi, and K. Shim, "Cure: An Efficient Clustering Algorithm for Large Databases," Proceeding of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington New York: ACM Press, (1998), pp.73-84.
  15. Hinkley, D.V., "On the Ratio of Two Correlated Normal Random Variables," Biometrika, Vol.56, No.3(1969), pp.635-639. https://doi.org/10.1093/biomet/56.3.635
  16. Hwang, S.S., S. Cho, and S. Park, "Keystroke dynamics-based authentication for mobile devices," Computers and Security, Vol.28 (2009), pp.85-93. https://doi.org/10.1016/j.cose.2008.10.002
  17. Lewis, V.B., Outliers in Statistical Data, John Wiley and Sons, 1994.
  18. MacQueen, J., "Some Methods for Classification and Analysis of Multivariate Observations," Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, Berkeley, Calififornia : University of California Press, (1967), pp.291-297.
  19. Medioni, G., I. Cohen, S. Hongeng, F. Bremond, and R. Nevatia, "Event Detection and Analysis from Video Streams," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.8, No.23(2001), pp.873-889.
  20. Pokrajac, D., A. Lazarevic, and L.J, Latecki, "Incremental Local Outlier Detection for Data Streams," IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, New York IEEE Press, (2007), pp.504-515.
  21. Press, W.H., B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C, Cambridge University Press, 1988.
  22. Ramaswamy, S., R. Rastogi, and K. Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the International Conference on Management of Data, Dallas, Texas, New York : ACM Press, (2000), pp.427-438.
  23. Strang, G., Linear Algebra and its Applications, Academic Press, 2nd edition, 1980.
  24. Yue, D., X. Wu, Y. Wang, Y. Li, and C.H. Chu, "A Review of Data Mining-Based Financial Fraud Detection Research," Proceedings of 2007 International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, PR. China, Sponsored by IEEE, (2007), pp.5514-5517.
  25. Zhang, T., R. Ramakrishnan, and M. Livny, "Birch: An Efficient data clustering method for very large databases," Proceedings for the ACM SIGMOD Conference on Management of Data, Montreal, Canada, New York : ACM Press, (1996), pp.103-114.
  26. Zhang, J. and M. Zulkernine, "Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection," Proceedings of 2006 IEEE International Conference on Communications, Istanbul, Turkey, New York : IEEE Press, (2006), pp.2388-2393.
  27. http://www.cs.umd.edu/-mount/ANN/Files/1.1.1/ANNmanual_1.1.1..pdf.