Density-based Outlier Detection for Very Large Data

Kim, Seung;Cho, Nam-Wook;Kang, Suk-Ho;

Journal of the Korean Operations Research and Management Science Society (한국경영과학회지)

Volume 35 Issue 2
/
Pages.71-88
/
2010
/
1225-1119(pISSN)
/
2733-4759(eISSN)

The Korean Operations Research and Management Science Society (한국경영과학회)

Density-based Outlier Detection for Very Large Data

대용량 자료 분석을 위한 밀도기반 이상치 탐지

김승 (서울산업대학교 산업공학과) ;
조남욱 (서울산업대학교산업정보시스템공학과) ;
강석호 (서울산업대학교 산업공학과)

Received : 2010.03.30
Accepted : 2010.05.13
Published : 2010.06.30

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

A density-based outlier detection such as an LOF (Local Outlier Factor) tries to find an outlying observation by using density of its surrounding space. In spite of several advantages of a density-based outlier detection method, the computational complexity of outlier detection has been one of major barriers in its application. In this paper, we present an LOF algorithm that can reduce computation time of a density based outlier detection algorithm. A kd-tree indexing and approximated k-nearest neighbor search algorithm (ANN) are adopted in the proposed method. A set of experiments was conducted to examine performance of the proposed algorithm. The results show that the proposed method can effectively detect local outliers in reduced computation time.

Keywords

References

Aggarwal, C.C. and P.S. Yu, "Outlier detection for high dimensional data," Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, New York: ACM Press, (2001), pp.37-46.
Agrawal, R., J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proceeding of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, New York: ACM Press, (1998), pp.94-105.
Agyemang, M. and C.I. Ezeife, "LSC- Mine: Algorithm for Mining Local Outliers," Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, Hershey, PA U.S.A. : IRM press, Vol.1(2004), pp.5-8.
Arya, S., D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu, "An optimal algo rithm for approximate nearest neighbor searching," Journal of the ACM, Vol.45, No.6 (1998), pp.891-923. https://doi.org/10.1145/293347.293348
Asuncion, A. and D.J. Newman, "UCI Machine Learning Repository," [http://www.ics.uci .edu/-mlearn/MLRepository.html], Irvine, CA : University of California, School of Information and Computer Science, 2007.
Bentley, J.L., "K -d trees for semidynamic point sets," Proceedings of 6th Annual ACM Symposium Computational Geometry, New York : ACM Press, 1990, pp.187-197.
Breunig, M.M., H.P. Kriegel, R.T. Ng, and J. Sander, "LOF : Identifying Density Based Local Outliers," Proceedings of the ACM SIGMOD Conference, Dallas, Texas, New York : ACM, (2000), pp.93-104.
Ester, M.M., H.P. Kriegel, J. Sander and X. Xu, "A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proceeding for 2nd International Conference on Knowledge Discovery and Data Mining(KDD '96), Portland, Oregon, AAAI Press, (1996), pp.226-231.
Ezawa, K.J. and S.W. Norton, "Constructing Bayesian Networks to predict Uncollectible Telecommunications Accounts," IEEE Expert, Vol.11, No.5(1996), pp.45-51. https://doi.org/10.1109/64.539016
Faloutsos, C. and K1. Lin, "FastMap : A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets," Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, New York: ACM Press, Vol.24, No.2(1995), pp.163-174.
Friedman, J.H., J.L. Bentley, and R.A. Finkel, "An algorithm for finding best matches in logarithmic expected time," ACM Transaction on Mathematical Software, Vol.3, No.3 (1977), pp.209-226. https://doi.org/10.1145/355744.355745
Friedman, J,H., J.L. Bentley, and R.A. Finkel, "An algorithm for finding best matches in logarithmic expected time," ACM Transaction on Mathematical Software, Vol.3, No.3 (1977), pp.209-226. https://doi.org/10.1145/355744.355745
Fukunaga, K., Introduction to Statistical Pattern Recognition, Academic Press, 2nd edition, 1977.
Guha, S., R. Rastogi, and K. Shim, "Cure: An Efficient Clustering Algorithm for Large Databases," Proceeding of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington New York: ACM Press, (1998), pp.73-84.
Hinkley, D.V., "On the Ratio of Two Correlated Normal Random Variables," Biometrika, Vol.56, No.3(1969), pp.635-639. https://doi.org/10.1093/biomet/56.3.635
Hwang, S.S., S. Cho, and S. Park, "Keystroke dynamics-based authentication for mobile devices," Computers and Security, Vol.28 (2009), pp.85-93. https://doi.org/10.1016/j.cose.2008.10.002
Lewis, V.B., Outliers in Statistical Data, John Wiley and Sons, 1994.
MacQueen, J., "Some Methods for Classification and Analysis of Multivariate Observations," Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, Berkeley, Calififornia : University of California Press, (1967), pp.291-297.
Medioni, G., I. Cohen, S. Hongeng, F. Bremond, and R. Nevatia, "Event Detection and Analysis from Video Streams," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.8, No.23(2001), pp.873-889.
Pokrajac, D., A. Lazarevic, and L.J, Latecki, "Incremental Local Outlier Detection for Data Streams," IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, New York IEEE Press, (2007), pp.504-515.
Press, W.H., B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C, Cambridge University Press, 1988.
Ramaswamy, S., R. Rastogi, and K. Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of the International Conference on Management of Data, Dallas, Texas, New York : ACM Press, (2000), pp.427-438.
Strang, G., Linear Algebra and its Applications, Academic Press, 2nd edition, 1980.
Yue, D., X. Wu, Y. Wang, Y. Li, and C.H. Chu, "A Review of Data Mining-Based Financial Fraud Detection Research," Proceedings of 2007 International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, PR. China, Sponsored by IEEE, (2007), pp.5514-5517.
Zhang, T., R. Ramakrishnan, and M. Livny, "Birch: An Efficient data clustering method for very large databases," Proceedings for the ACM SIGMOD Conference on Management of Data, Montreal, Canada, New York : ACM Press, (1996), pp.103-114.
Zhang, J. and M. Zulkernine, "Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection," Proceedings of 2006 IEEE International Conference on Communications, Istanbul, Turkey, New York : IEEE Press, (2006), pp.2388-2393.
http://www.cs.umd.edu/-mount/ANN/Files/1.1.1/ANNmanual_1.1.1..pdf.

Journal of the Korean Operations Research and Management Science Society (한국경영과학회지)

Density-based Outlier Detection for Very Large Data

대용량 자료 분석을 위한 밀도기반 이상치 탐지

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)