Association-based Unsupervised Feature Selection for High-dimensional Categorical Data

Lee, Changki;Jung, Uk;

doi:10.7469/JKSQM.2019.47.3.537

Journal of Korean Society for Quality Management (품질경영학회지)

Volume 47 Issue 3
/
Pages.537-552
/
2019
/
1229-1889(pISSN)
/
2287-9005(eISSN)

Korean Society for Quality Management (한국품질경영학회)

DOI QR Code

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법

Lee, Changki (College of Business Administration, Dongguk University) ;
Jung, Uk (College of Business Administration, Dongguk University)

이창기 (동국대학교 경영대학) ;
정욱 (동국대학교 경영대학)

Received : 2019.06.17
Accepted : 2019.07.01
Published : 2019.09.30

https://doi.org/10.7469/JKSQM.2019.47.3.537 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

Blum, A. L., and Langley, P. 1997. "Selection of Relevant Features and Examples in Machine Learning." Artificial intelligence 97(1-2):245-271. https://doi.org/10.1016/S0004-3702(97)00063-5
Burnaby, T. P. 1970. "On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information." Journal of the International Association for Mathematical Geology 2(1):25-38. https://doi.org/10.1007/BF02332078
Chakraborty, D. D. 2008. "Statistical Decision Theory. Estimation, Testing and Selection." Investigación Operacional 29(2):184-185.
Cheng, V., Li, C. H., Kwok, J. T., and Li, C. K. 2004. "Dissimilarity learning for nominal data." Pattern Recognition 37(7):1471-1477. https://doi.org/10.1016/j.patcog.2003.12.015
Chong, H. R., Hong, S. H., Lee, M. K., and Kwon, H. M. 2017. "Quality Management on the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):629-648. https://doi.org/10.7469/JKSQM.2017.45.4.629
Cost, S., and Salzberg, S. 1993. "A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features." Machine learning 10(1):57-78. https://doi.org/10.1007/BF00993481
Goodall, D. W. 1966. "A New Similarity Index Based on Probability." Biometrics, 882-907.
Guyon, I., and Elisseeff, A. 2003. "An Introduction to Variable and Feature Sselection." Journal of Machine Learning Research, 3(Mar), 1157-1182.
Hamming, R. W. 1950. "Error Detecting and Error Correcting Codes." Bell System Technical Journal, 29(2):147-160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Jia, H., Cheung, Y. M., and Liu, J. 2016. "A New Distance Metric for Unsupervised Learning of Categorical Data." IEEE Transactions on Neural Networks and Learning Systems 27(5):1065-1079. https://doi.org/10.1109/TNNLS.2015.2436432
Kullback, S., and Leibler, R. A. 1951. "On Information and Sufficiency." The Annals of Mathematical Statistics 22(1):79-86. https://doi.org/10.1214/aoms/1177729694
Le, S. Q., and Ho, T. B. 2005. "An Association-based Dissimilarity Measure for Categorical Data." Pattern Recognition Letters 26(16):2549-2557. https://doi.org/10.1016/j.patrec.2005.06.002
Lin, J. 1991. "Divergence Measures Based on the Shannon Entropy." IEEE Transactions on Information Theory 37(1):145-151. https://doi.org/10.1109/18.61115
Lin, D. 1998. "An Information-theoretic Definition of Similarity." In Icml 98(1998):296-304.
Liu, H., Sun, J., Liu, L., and Zhang, H. 2009. "Feature Selection with Dynamic Mutual Information." Pattern Recognition 42(7):1330-1339. https://doi.org/10.1016/j.patcog.2008.10.028
Mitra, P., Murthy, C. A., and Pal, S. K. 2002. "Unsupervised Feature Selection Using Feature Similarity." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3):301-312. https://doi.org/10.1109/34.990133
Il seok Oh. 2008. Pattern recognition, Kyobo book.
Park, Y. J., and Kim, S. B. 2014. "Unsupervised Feature Selection Method Based on Principal Component Loading Vectors." Journal of Korean Institute of Industrial Engineers 40(3):275-282. https://doi.org/10.7232/JKIIE.2014.40.3.275
Quinlan, J. R. 2014. C4. 5: Programs for Machine Learning. Elsevier.
Ree, S. 2017. "Proposal of Korean Quality Management in the 4th Industrial Revolution." Journal of the Korean Society for Quality Management 45(4):739-760. https://doi.org/10.7469/JKSQM.2017.45.4.739
Smirnov, E. S. 1968. "On Exact Methods in Systematics." Systematic Biology 17(1):1-13. https://doi.org/10.1093/sysbio/17.1.1
Stanfill, C., and Waltz, D. L. 1986. "Toward Memory-based reasoning. Commun." ACM, 29(12):1213-1228. https://doi.org/10.1145/7902.7906
Tibshirani, R. 1996. "Regression Shrinkage and Selection Via the Lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Vergara, J. R., and Estévez, P. A. 2014. "A Review of Feature Selection Methods Based on Mutual Information." Neural Computing and Applications 24(1):175-186. https://doi.org/10.1007/s00521-013-1368-0
Xie, J., Szymanski, B., and Zaki, M. 2010. Learning Dissimilarities for Categorical Symbols. In Feature Selection in Data Mining:97-106.

Journal of Korean Society for Quality Management (품질경영학회지)

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)