Illegal Cash Accommodation Detection Modeling Using Ensemble Size Reduction

신용카드 불법현금융통 적발을 위한 축소된 앙상블 모형

  • Received : 2010.02.10
  • Accepted : 2010.03.07
  • Published : 2010.03.30

Abstract

Ensemble approach is applied to the detection modeling of illegal cash accommodation (ICA) that is the well-known type of fraudulent usages of credit cards in far east nations and has not been addressed in the academic literatures. The performance of fraud detection model (FDM) suffers from the imbalanced data problem, which can be remedied to some extent using an ensemble of many classifiers. It is generally accepted that ensembles of classifiers produce better accuracy than a single classifier provided there is diversity in the ensemble. Furthermore, recent researches reveal that it may be better to ensemble some selected classifiers instead of all of the classifiers at hand. For the effective detection of ICA, we adopt ensemble size reduction technique that prunes the ensemble of all classifiers using accuracy and diversity measures. The diversity in ensemble manifests itself as disagreement or ambiguity among members. Data imbalance intrinsic to FDM affects our approach for ICA detection in two ways. First, we suggest the training procedure with over-sampling methods to obtain diverse training data sets. Second, we use some variants of accuracy and diversity measures that focus on fraud class. We also dynamically calculate the diversity measure-Forward Addition and Backward Elimination. In our experiments, Neural Networks, Decision Trees and Logit Regressions are the base models as the ensemble members and the performance of homogeneous ensembles are compared with that of heterogeneous ensembles. The experimental results show that the reduced size ensemble is as accurate on average over the data-sets tested as the non-pruned version, which provides benefits in terms of its application efficiency and reduced complexity of the ensemble.

불법현금융통 적발모형 개발에 앙상블 접근방법을 사용하였다. 불법현금융통은 국내 신용카드사의 손익에 영향을 미치며 최근 국제화되고 있음에도 불구하고 학문적인 접근이 이루어지지 않았다. 부정행위 적발모형(Fraud Detection Model, FDM)은 데이터 불균형 문제로 인하여 좋은 성능을 얻기 어려운데, 다수의 모형을 결합하는 앙상블이 대안으로 제시되어 왔다. 앙상블에 포함된 모형들의 다양성이 보장된다면 단일모형에 비해 더 좋은 성능을 보인다는 점은 이미 인정되고 있으며, 최근 연구 결과는 학습된 모든 기본모형들을 사용하는 것보다 적절한 기본모형들만 선택하여 앙상블에 포함시키는 것이 바람직하다는 것이다. 본 논문에서는 효과적인 불법현금융통 적발을 위하여 축소된 앙상블 기법을 사용하는데, 정확성과 다양성 척도를 사용하여 앙상블에 참여할 기본모형을 선택하는 것이다. 다양성은 앙상블을 구성하는 기본모형들 사이의 불일치 (Disagreement or Ambiguity)를 의미하는데, FDM에 내재된 데이터 불균형문제를 고려하여 두 가지 측면에 중점을 두었다. 첫째, 학습 자료의 추출 과정에서 다양성을 확보하기 위한 소수 범주의 과잉추출 방법과 적절한 훈련 방법에 대해 설명하였다. 둘째, 소수범주에 초점을 맞추어 기존의 다양성 척도를 효과적인 척도로 변형시키고, 전진추가법과 후진소거법의 동적 다양성 계산법을 도입하여 앙상블에 참여할 기본모형을 평가하였다. 실험에 사용된 학습 알고리즘은 신경망, 의사결정수와 로짓 회귀분석이었으며, 동질적 앙상블과 이질적 앙상블을 구성하여 성능평가를 하였다. 실험결과 불법현금융통 적발모형에 있어 축소된 앙상블은 모든 기본모형이 포함된 앙상블과 성능 차이가 없었다. 축소된 앙상블은 앙상블 구성의 복잡성을 감소시키고 구현을 용이하게 한다는 점에서 FDM에서도 유력한 모형 수립 접근방법이 될 수 있음을 보였다.

Keywords

References

  1. 금융감독원 보도자료, 신용카드사 경영실적, (2002-2006).
  2. 안철경, 조혜원, 김경환, 국내외 보험사기관리 실태 분석 : 선진사례 및 설문분석을 중심으로, 보험개발원, (2002).
  3. 강필성, 이형주, 조성준, "데이터 불균형 문제에서의 SVM 앙상블 기법의 적용", 한국정보과학회 추계학술대회논문집, 31권 2호(2005), 706-708.
  4. 김정동, 박종수, "자동차보험 사기 적발 모형에 관한 연구", (2006).
  5. 유상진, 박문로, "데이터마이닝 기법을 활용한 의료보험 진료비 청구 삭감분석 시스템 개발 및 구현에 관한 연구", Information System Review, Vol.7(2005), 275-295.
  6. 조성목, "신용카드불법거래 유형 및 대응방안", 신용카드 30호(2004).
  7. 허준, 김종우, "불균형 데이터 집합에서의 의사결정나무 추론", Information System Review, Vol.9(2007), 45-65.
  8. Breiman, L., "Bagging Predictors", Machine Learning, Vol.24(1996), 123-140.
  9. Breiman, L., "Arcing Classifiers", Annals of Statistics, Vol.26(1998), 801-849. https://doi.org/10.1214/aos/1024691079
  10. Batista, G., Pati, R. C. and Monard, M. C., "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data", SIGKDD Exploration, Vol.6(2004), 20-29. https://doi.org/10.1145/1007730.1007735
  11. Brause, R., T. Langsdorf, and M. Hepp, "Neural Data Mining for Credit Card Fraud Detection", Proceeding of the 11th IEEE International Conference on Tools with Artificial Intelligence, November Vol.8, No.10(1999) 103-105.
  12. Bruzzone, L. and S. B. Serpico, "Classification of Imbalanced Remote-sensing Data by Neural Networks", Pattern Recognition Letters, Vol.18(1997), 1323-1328. https://doi.org/10.1016/S0167-8655(97)00109-8
  13. Chan, P. K., W. Fan, A. L. Prodromidis and S. J. Stolfo, "Distributed Data Mining in Credit Card Fraud Detection", IEEE Intelligent Systems, Vol.14, No.6(1999), 67-74. https://doi.org/10.1109/5254.809570
  14. Chawla, N. V., K. W. Boywer, L. O. Hall and W. P. Kegelmeyer, "SMOTE : Synthetic Minority Over-sampling Technique", Journal of Artificial Intelligence Research, Vol.16(2002), 321-357.
  15. Chawla, N. V., N. Japkowicz and A. Kolcz, "Editorial : Special Issue on Learning from Imbalanced Data Sets", SIGKDD Exploration, Vol.6(2004), 1-6. https://doi.org/10.1145/1046456.1046457
  16. Chen, R. C., S. T. Luo, X. Liang and V. C. S. Lee, "Personalized approach based on SVM and ANN for detecting credit card fraud", Proceedings of the IEEE International Conference on Neural Networks and Brain, October(2005), 810-815.
  17. Chiu, C. and Chieh-Yuan Tsai, "A Web Services-Based Collaborative Scheme for Credit Card Fraud Detection", Proceedings of the 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service, March Vol.28, No.31(2004), 177-181.
  18. Dietterich, T., "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees : Bagging, Boosting and Randomization", Machine Learning, Vol.40, No.2(2000), 139-157. https://doi.org/10.1023/A:1007607513941
  19. Estabrooks, A., T. Jo and N. Japkowicz, "A Multiple Resampling Method for Learning from Imbalances Data Sets", Computational Intelligence, Vol.20, No.1(2004), 18-36. https://doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x
  20. Fawcett, T. and F. Provost, "Combining Data Mining and Machine Learning for Effective User Profile", Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI(1996), 8-13.
  21. Fawcett, T. and F. Provost, "Adaptive Fraud Detection", Data Mining and Knowledge Discovery, Vol.1(1997), 291-316. https://doi.org/10.1023/A:1009700419189
  22. Freund, Y and R. Shapiro, "A Decision-theoretic Generalization of On-line Learning and an Applicationto Boosting", Journal of Computer and System Sciences, Vol.55(1997), 119-139. https://doi.org/10.1006/jcss.1997.1504
  23. Guo, H. and H. L. Viktor, "Learning from Imbalanced data Sets with Boosting and Data Generation : The DataBoos-IM Approach", SIGKDD Exploring, Vol.6(2004), 30-39. https://doi.org/10.1145/1007730.1007736
  24. Hansen, L. and P. Salomon, "Neural Network Ensembles", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.12,(1990), 993-1001. https://doi.org/10.1109/34.58871
  25. Hernandez, C., M. Fernandez, and M. Oritiz, "New Experimental Ensembles of Multilayer Feedforward for Classification Problem", Int'l Joint Conf. on Neural Networks, (2005).
  26. Japkowicz N. and S. Stephen, "The Class Imbalance Problem : A Systematic Study", Intelligent Data Analysis, Vol.6, No.5(2002) 429-450.
  27. Krogh, A. and J. Vedelsby, "Neural Networks Ensembles, Cross Validation, and Active Learning", Advances in Neural Information Processing Systems, (1995), 231-238.
  28. Kuncheva, L. and C. J. Whitaker, "Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy", IEEE Transactions on System, Man and Cybernetics, Vol.32, No.2(2002), 146-156. https://doi.org/10.1109/3477.990871
  29. Kubat, M., R. Holte and S. Matwin, "Machine Learning for the Detection of Oil Spills in Satellite Radar Images", Machine Learning, Vol.30(1998), 195-215 https://doi.org/10.1023/A:1007452223027
  30. Lee, W., S. Stolfo, and K. Mok. "A Data Mining Framework for Building Intrusion Detection Models," Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, May(1999).
  31. Opitz, D., "Feature Selection for Ensembles", Proc. of the 16th National Conf. on Artificial Intelligence, AAAI, (1999), 379-384.
  32. Opitz, D. and J. Shavlik, "Actively Searching for an Effective Neural Network Ensembles", Connection Science, Vol.8, No3(1996), 337-353. https://doi.org/10.1080/095400996116802
  33. Quilan, J. R., "Bagging, Boosting, and C4.5", Proc. of the 13th National Conf. on Artificial Intelligence, (1996), 725-730.
  34. Panigrahi, S., A. Kundu, S. Sural and A. K. Majumdar, "Credit card fraud detection : A fusion approach using Dempster-Shafer theory and Bayesian learning", Information Fusion, Vol.10(2009), 354-363. https://doi.org/10.1016/j.inffus.2008.04.001
  35. Radivojac, P., V. N. V. Chawla, K. A. Dunker and Z. Obradovic, "Classification and Knowledge Discovery in Protein Databases", Journal of Biomedical Informatics, Vol.37(2004), 224-239. https://doi.org/10.1016/j.jbi.2004.07.008
  36. Rooney, N., D. Patterson and C. Nugent, "Reduced Ensemble Size Stacking", Proc. of the 16th IEEE International Conference on Tools with Artificial Intelligence(ICTAI), (2004), 266-271.
  37. Rooney, N., D. Patterson and C. Nugent, "Pruning Extension to Stacking", Intelligent Data Analysis, Vol.10(2006), 47-66.
  38. Stijn, V., R. A. Derrig and G. Dedene, "A Case Study of Applying Boosting Naive Bayes to Claim Fraud Diagnosis", IEEE Transactions on Knowledge and Data Engineering, (2004), 612-620. https://doi.org/10.1109/TKDE.2004.1277822
  39. Stolfo, S. J., W. Fan, W. Lee, A. Prodromidis and P. K. Chan, "JAM : Java Agents for Meta-Learning over Distributed Databases", Proc. of 3rd Int'l Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA., (1997), 74-81.
  40. Weisberg, H. I. and R. A. Derrig, "Fraud and Automobile Insurance : A Report on the Baseline Study of Bodily Injury Claims in Massachusetts," Journal of Insurance Regulation, Vol.9(1991), 427-541.
  41. Wolpert, D., "Stacked Generalization", Neural Networks, Vol.5(1992), 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1
  42. Yan, R., Y. Liu, R. Jin and A. Hauptman, "On Predicting Rare Classes with SVM Ensembles in Scene Classification", IEEE International Conference on Acoustics, Speech and Signal Processing, (2003).
  43. Zhao, Y., J. Gao, and X. Yang, "A Survey of Neural Network Ensembles", International Conference on Neural Networks and Brain, (2005), 438-442.
  44. Zhou, Z. H., J. Wu, and W. Tang, "Ensembling Neural Networks : Many could be better than all", Artificial Intelligence, Vol.137, No.1(2002), 239-263. https://doi.org/10.1016/S0004-3702(02)00190-X