DOI QR코드

DOI QR Code

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법

  • Lee, Changki (College of Business Administration, Dongguk University) ;
  • Jung, Uk (College of Business Administration, Dongguk University)
  • 이창기 (동국대학교 경영대학) ;
  • 정욱 (동국대학교 경영대학)
  • Received : 2019.06.17
  • Accepted : 2019.07.01
  • Published : 2019.09.30

Abstract

Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Blum, A. L., and Langley, P. 1997. "Selection of Relevant Features and Examples in Machine Learning." Artificial intelligence 97(1-2):245-271. https://doi.org/10.1016/S0004-3702(97)00063-5
  2. Burnaby, T. P. 1970. "On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information." Journal of the International Association for Mathematical Geology 2(1):25-38. https://doi.org/10.1007/BF02332078
  3. Chakraborty, D. D. 2008. "Statistical Decision Theory. Estimation, Testing and Selection." Investigación Operacional 29(2):184-185.
  4. Cheng, V., Li, C. H., Kwok, J. T., and Li, C. K. 2004. "Dissimilarity learning for nominal data." Pattern Recognition 37(7):1471-1477. https://doi.org/10.1016/j.patcog.2003.12.015
  5. Chong, H. R., Hong, S. H., Lee, M. K., and Kwon, H. M. 2017. "Quality Management on the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):629-648. https://doi.org/10.7469/JKSQM.2017.45.4.629
  6. Cost, S., and Salzberg, S. 1993. "A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features." Machine learning 10(1):57-78. https://doi.org/10.1007/BF00993481
  7. Goodall, D. W. 1966. "A New Similarity Index Based on Probability." Biometrics, 882-907.
  8. Guyon, I., and Elisseeff, A. 2003. "An Introduction to Variable and Feature Sselection." Journal of Machine Learning Research, 3(Mar), 1157-1182.
  9. Hamming, R. W. 1950. "Error Detecting and Error Correcting Codes." Bell System Technical Journal, 29(2):147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
  10. Jia, H., Cheung, Y. M., and Liu, J. 2016. "A New Distance Metric for Unsupervised Learning of Categorical Data." IEEE Transactions on Neural Networks and Learning Systems 27(5):1065-1079. https://doi.org/10.1109/TNNLS.2015.2436432
  11. Kullback, S., and Leibler, R. A. 1951. "On Information and Sufficiency." The Annals of Mathematical Statistics 22(1):79-86. https://doi.org/10.1214/aoms/1177729694
  12. Le, S. Q., and Ho, T. B. 2005. "An Association-based Dissimilarity Measure for Categorical Data." Pattern Recognition Letters 26(16):2549-2557. https://doi.org/10.1016/j.patrec.2005.06.002
  13. Lin, J. 1991. "Divergence Measures Based on the Shannon Entropy." IEEE Transactions on Information Theory 37(1):145-151. https://doi.org/10.1109/18.61115
  14. Lin, D. 1998. "An Information-theoretic Definition of Similarity." In Icml 98(1998):296-304.
  15. Liu, H., Sun, J., Liu, L., and Zhang, H. 2009. "Feature Selection with Dynamic Mutual Information." Pattern Recognition 42(7):1330-1339. https://doi.org/10.1016/j.patcog.2008.10.028
  16. Mitra, P., Murthy, C. A., and Pal, S. K. 2002. "Unsupervised Feature Selection Using Feature Similarity." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3):301-312. https://doi.org/10.1109/34.990133
  17. Il seok Oh. 2008. Pattern recognition, Kyobo book.
  18. Park, Y. J., and Kim, S. B. 2014. "Unsupervised Feature Selection Method Based on Principal Component Loading Vectors." Journal of Korean Institute of Industrial Engineers 40(3):275-282. https://doi.org/10.7232/JKIIE.2014.40.3.275
  19. Quinlan, J. R. 2014. C4. 5: Programs for Machine Learning. Elsevier.
  20. Ree, S. 2017. "Proposal of Korean Quality Management in the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):739-760. https://doi.org/10.7469/JKSQM.2017.45.4.739
  21. Smirnov, E. S. 1968. "On Exact Methods in Systematics." Systematic Biology 17(1):1-13. https://doi.org/10.1093/sysbio/17.1.1
  22. Stanfill, C., and Waltz, D. L. 1986. "Toward Memory-based reasoning. Commun." ACM, 29(12):1213-1228. https://doi.org/10.1145/7902.7906
  23. Tibshirani, R. 1996. "Regression Shrinkage and Selection Via the Lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  24. Vergara, J. R., and Estévez, P. A. 2014. "A Review of Feature Selection Methods Based on Mutual Information." Neural Computing and Applications 24(1):175-186. https://doi.org/10.1007/s00521-013-1368-0
  25. Xie, J., Szymanski, B., and Zaki, M. 2010. Learning Dissimilarities for Categorical Symbols. In Feature Selection in Data Mining:97-106.