DOI QR코드

DOI QR Code

결정트리를 이용하는 불완전한 데이터 처리기법

Incomplete data handling technique using decision trees

  • 이종찬 (청운대학교 컴퓨터공학과)
  • Lee, Jong Chan (Dept. of Computer Engineering, Chungwoon University)
  • 투고 : 2021.05.16
  • 심사 : 2021.08.20
  • 발행 : 2021.08.28

초록

본 논문은 손실값을 포함하는 불완전한 데이터를 처리하는 방법에 대해 논한다. 손실값을 최적으로 처리한다는 것은 학습 데이터가 가지고 있는 정보들에서 본래값과 가장 근사한 추정치를 구하고, 이 값으로 손실값을 대치하는 것이다. 이것을 실현하기 위한 방안으로 분류기가 정보를 분류하는 과정에서 완성되어가는 결정트리를 이용한다. 다시말해 이 결정트리는 전체 학습 데이터 중에서 손실값을 포함하지 않는 완전한 정보만을 C4.5 분류기에 입력하여 학습하는 과정에서 얻어진다. 이 결정트리의 노드들은 분류 변수의 정보를 가지는데, 루트에 가까운 상위 노드일수록 많은 정보를 포함하게 되고 말단 노드에서는 루트로부터의 경로를 통해 분류 영역을 형성하게 된다. 또한 각 영역에는 분류된 데이터 사건들의 평균이 기록된다. 손실값을 포함하는 사건들은 이러한 결정트리에 입력되어 각 노드의 정보에 따라 순회과정을 통해 사건과 가장 근접한 영역을 찾아가게 된다. 이 영역에 기록된 평균값을 손실값의 추정치로 간주하고, 보상 과정은 완성된다.

This paper discusses how to handle incomplete data including missing values. Optimally processing the missing value means obtaining an estimate that is the closest to the original value from the information contained in the training data, and replacing the missing value with this value. The way to achieve this is to use a decision tree that is completed in the process of classifying information by the classifier. In other words, this decision tree is obtained in the process of learning by inputting only complete information that does not include loss values among all training data into the C4.5 classifier. The nodes of this decision tree have classification variable information, and the higher node closer to the root contains more information, and the leaf node forms a classification region through a path from the root. In addition, the average of classified data events is recorded in each region. Events including the missing value are input to this decision tree, and the region closest to the event is searched through a traversal process according to the information of each node. The average value recorded in this area is regarded as an estimate of the missing value, and the compensation process is completed.

키워드

과제정보

This paper is sponsored by the academic research project of Chungwoon University in 2021. (No. 2021-49).

참고문헌

  1. J. Han, J. Pei & M. Kamber. (2011). Data Mining: Concepts and Techniques. Waltham : Elsevier.
  2. R. Kohavi & J. R. Quinlan. (2002). Data mining tasks and methods: Classification: Decision-tree discovery, Handbook of data mining and knowledge discovery. New York : Oxford University Press, 267-276.
  3. T. Delavallade & T. H. Dang. (2007). Using Entropy to Impute Missing Data in a Classification Task. IEEE International Fuzzy Systems Conference. (pp. 1-6). DOI : 10.1109/FUZZY.2007.4295430
  4. A. Sportisse, C. Boyer, A. Dieuleveut & J. Josse. (2020). Debiasing Averaged Stochastic Gradient Descent to handle missing values. 34th Conference on Neural Information Processing Systems. (pp. 1-11). Vancouver.
  5. T. F. Johnson, N. J. B. Isaac, A. Paviolo & M. Gonzalez-Suarez. (2020). Handling missing values in trait data. Global Ecology & Biogeography, 1-12. DOI : 10.1111/geb.13185
  6. S. Huang & C. Cheng. (2020). A Safe-Region Imputation Method for Handling Medical Data with Missing Values. Symmetry, 12(11), 1792. DOI : 10.3390/sym12111792
  7. J. You, X. Ma, D. Y. Ding, M. Kochenderfer & J. Leskovec. (2020). Handling Missing Data with Graph Representation Learning. arXiv preprint arXiv:2010.16418.
  8. J. R. Quinlan. (1993). C4.5 : Program for Machine Learning. San Mateo : Morgan Kaufmann.
  9. J. C. Lee, D. H. Seo, C. H. Song & W. D. Lee. (2007). FLDF based Decision Tree using Extended Data Expression. The 6th Conference on Machine Learning & Cybernetics. (pp.3478-3483).
  10. D. Kim, D. Lee & W. D. Lee. (2006). Classifier using Extended Data Expression. In 2006 IEEE Mountain Workshop on Adaptive and Learning Systems. (pp. 154-159). DOI : 10.1109/SMCALS.2006.250708
  11. J. C. Lee. (2018). Application Examples Applying Extended Data Expression Technique to Classification Problems. Journal of the Korea convergence society, 9(12), 9-15. DOI : 10.15207 /JKCS.2018.9.12.009 https://doi.org/10.15207/JKCS.2018.9.12.009
  12. J. C. Lee. (2019). Deep Learning Model for Incomplete Data, Journal of the Korea Convergence Society, 10(2), 1-6. DOI : 10.15207 /JKCS.2019.10.2.001 https://doi.org/10.15207/JKCS.2019.10.2.001
  13. J. C. Lee & W. D. Lee. (2010). Classifier handling incomplete data. Journal of the Korea Institute of Information and Communication Engineering, 14(1), 53-62. https://doi.org/10.6109/jkiice.2010.14.1.053
  14. J. C. Lee. (2021). A data extension technique to handle incomplete data. Journal of the Korea Convergence Society, 12(2), 7-13. DOI: 10.15207 /JKCS.2021.12.2.007 https://doi.org/10.15207/JKCS.2021.12.2.007
  15. Center for Machine Learning and Intelligent Systems, University of California, Irvine. (2020). UCI Machine Learning Repository. https:// archive.ics.uci.edu/ml/datasets.php