DOI QR코드

DOI QR Code

A Classifier Capable of Handling Incomplete Data Set

불완전한 데이터를 처리할수 있는 분류기

  • 이종찬 (청운대학교 인터넷학과) ;
  • 이원돈 (충남대학교 전기정보통신공학부 컴퓨터)
  • Published : 2010.01.30

Abstract

This paper introduces a classification algorithm which can be applied to a learning problem with incomplete data sets, missing variable values or a class value. This algorithm uses a data expansion method which utilizes weighted values and probability techniques. It operates by extending a classifier which are considered to be in the optimal projection plane based on Fisher's formula. To do this, some equations are derived from the procedure to be applied to the data expansion. To evaluate the performance of the proposed algorithm, results of different measurements are iteratively compared by choosing one variable in the data set and then modifying the rate of missing and non-missing values in this selected variable. And objective evaluation of data sets can be achieved by comparing, the result of a data set with non-missing variable with that of C4.5 which is a known knowledge acquisition tool in machine learning.

본 논문은 변수 값들이나 부류 값을 손실한, 불완전한 데이터를 포함하는 데이터 집합을 가지고 학습하는 문제에 적용될 수 있는 분류 알고리즘을 소개한다. 이 알고리즘은 가중치 값과 확률 기법들을 이용하는 데이터 확장 방법을 사용한다. 이는 휘셔(Fisher)의 식을 기반으로 최적의 투사 면이 되도록 고려된 분류기를 확장함으로써 수행한다. 이를 위해, 데이터 확장에 적용되는 과정으로 부터 몇몇 식들이 유도된다. 제안한 알고리즘의 성능평가를 위해, 데이터에서 하나의 변수를 선택하고 이 선택된 변수에 소실 값과 소실되지 않은 값들의 비율을 변형함에 의해 다른 측정값들의 결과들이 반복적으로 비교된다. 또한 데이터 집합의 객관적인 평가를 위해 기계학습에서 지식 습득 도구로 널리 쓰이는 C4.5의 결과와 비교한다.

Keywords

References

  1. N.H.Nie, C.H.Hull, J.G.Jenkins, K. Steinbrenner, Bent D.H, SPSS, 2nd ed. NewYork: McGraw -Hill, 1975.
  2. Roderick J. A. Littile, Donald B. Rubin, Statistical Analysis with Missing Data, 2ED, John Wiley & Sons, 2002
  3. J.M.Robins, A.Rotnitzky, L. P. Zhao," Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data", J. Am. Statist. Assoc. 90, pp 106-121, 1995. https://doi.org/10.1080/01621459.1995.10476493
  4. J.H.Friedman, "A recursive partitioning decision rule for non-parametric classification", IEEE Transactions on Computer Science, pp404- 408, 1977.
  5. J. W. Grzymala-Busse, "Rough set strategies to data with missing attribute values", Workshop on Foundations & New Directions in Data Mining, pp19-22, Nov. 2003.
  6. R. J. Hathaway, J. C. Bezdek, "Fuzzy c-means clustering of incomplete data", IEEE Trans. on Systems, Man, Cybernetics-part B: Cybernetics, Vol.31, No. 5, 2001.
  7. M. Kryszkiewicz, "Rough set approach to incomplete information systems", Information Science, Vol.112, pp39-49, 1998. https://doi.org/10.1016/S0020-0255(98)10019-1
  8. J. R Quinlan, "C4.5:Program for Machine Learning," San Mateo, Calif, Morgan Kaufmann, 1993.
  9. A. P. Dempster, N. M. Laird, D. B. Rubin, "Maximum-likelihood from incomplete data via the EM algorithm", Journal of the Royal Statistical Society, Vol.B39, pp1-38, 1977.
  10. J. W. Grzymala-Busse,"vOn the unknown attribute values in learning from examples", ISMIS-91, 6th International Symposium on Methodologies for Intelligent Systems, pp368-377, Oct. 1991.
  11. J. Han, M. Kamber, Data Mining : Concept and Techniques, Morgan Kaufmann publishers, 2001.
  12. T.P.Hong, L.H. Tseng, B.C. Chien, "Learning fuzzy rules from incomplete numerical data by rough sets", IEEE international Conference on Fuzzy Syatems, pp1438-1443, 2002.
  13. I. Koninenko, I. Brtko, E. Roskar, "Experiments in automatic learning of medical diagnostic rules", Technical Report, Jozef Stefan Institute, Ljubljana, 1984.
  14. R.Slowinski, J. Stefanowski, "Handling various types of uncertainty in the rough set approach", International Workshop on Rough Sets and Knowlege Discovery, pp366-376, 1993
  15. M. Weiser, "Some Computer Science Issues in Ubiquitous Computing," Com. ACM, Vol. 36, No.7, pp.75-84, July. 1993 https://doi.org/10.1145/159544.159617
  16. Mehmed Kantardzic, "Data Mining:Concepts, Models, Methods, and Algorithms," Wiley- IEEE Press, pp. 139-161, 2002.
  17. D. Kim, D. Lee, W. D. Lee, "Classifier using Extended Data Expression," IEEE Mountain Workshop on Adaptive and Learning Systems, July. 2006
  18. J. C. Lee, Y. R Kiln, W. D. Lee, S. H. Lee, "Pattern Classifying Neural Network Based on Fisher's Linear Discreminant", Inter'l Joint Conference on Neural Networks (IJCNN), Vol. 1, pp743-748. July 1992.
  19. J. C. Lee, Y. H. Kim, W. D. Lee, S. H. Lee, "A method to find the structure and weights of layered neural networks", World Congress on Neural Networks, Vol llI, July 1993.
  20. Ronny Kohavi, J.R.Quinlan, "Data mining tasks and methods: Classification; Decision-tree discovery," Handbook of data mining and knowledge discovery, Oxford University Press, pp.267-276, 2002.
  21. Thomas G. Dietterich, "An Experimental Com-parison of three methods for constructing emsembles for decision trees: Bagging, Boosting and randomization.", Machine Learning, Vol.40, No. 2, pp139-157, August, 2000. https://doi.org/10.1023/A:1007607513941

Cited by

  1. SVM과 딥러닝에서 불완전한 데이터를 처리하기 위한 알고리즘 vol.11, pp.3, 2010, https://doi.org/10.15207/jkcs.2020.11.3.001
  2. 불완전한 데이터를 처리하기 위한 데이터 확장기법 vol.12, pp.2, 2010, https://doi.org/10.15207/jkcs.2021.12.2.007
  3. 결정트리를 이용하는 불완전한 데이터 처리기법 vol.12, pp.8, 2021, https://doi.org/10.15207/jkcs.2021.12.8.039