• Title/Summary/Keyword: categorical variable

Search Result 103, Processing Time 0.026 seconds

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.6
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

Application of GLIM to the Binary Categorical Data

  • Sok, Yong-U
    • Journal of the military operations research society of Korea
    • /
    • v.25 no.2
    • /
    • pp.158-169
    • /
    • 1999
  • This paper is concerned with the application of generalized linear interactive modelling(GLIM) to the binary categorical data. To analyze the categorical data given by a contingency table, finding a good-fitting loglinear model is commonly adopted. In the case of a contingency table with a response variable, we can fit a logit model to find a good-fitting loglinear model. For a given $2^4$ contingency table with a binary response variable, we show the process of fitting a loglinear model by fitting a logit model using GLIM and SAS and then we estimate parameters to interpret the nature of associations implied by the model.

  • PDF

On the Categorical Variable Clustering

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.7 no.2
    • /
    • pp.219-226
    • /
    • 1996
  • Basic objective in cluster analysis is to discover natural groupings of items or variables. In general, variable clustering was conducted based on some similarity measures between variables which have binary characteristics. We propose a variable clustering method when variables have more categories ordered in some sense. We also consider some measures of association as a similarity between variables. Numerical example is included.

  • PDF

On the clustering of huge categorical data

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.6
    • /
    • pp.1353-1359
    • /
    • 2010
  • Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data (고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.3
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Unequal Size, Two-way Analysis of Variance for Categorical Data

  • Chung, Han-Yong
    • Journal of the Korean Statistical Society
    • /
    • v.5 no.1
    • /
    • pp.29-34
    • /
    • 1976
  • The techniques about the analysis of variance for quantitative variables have been well-developed. But when the variable is categorical, we must switch to a completely different set of varied techniques. R.J. Light and B.H. Margolin presented one kind of techniques for categorical data in their paper, where there are G unordered experimental groups and I unordered response categories.

  • PDF

Comparison of Data Mining Classification Algorithms for Categorical Feature Variables (범주형 자료에 대한 데이터 마이닝 분류기법 성능 비교)

  • Sohn, So-Young;Shin, Hyung-Won
    • IE interfaces
    • /
    • v.12 no.4
    • /
    • pp.551-556
    • /
    • 1999
  • In this paper, we compare the performance of three data mining classification algorithms(neural network, decision tree, logistic regression) in consideration of various characteristics of categorical input and output data. $2^{4-1}$. 3 fractional factorial design is used to simulate the comparison situation where factors used are (1) the categorical ratio of input variables, (2) the complexity of functional relationship between the output and input variables, (3) the size of randomness in the relationship, (4) the categorical ratio of an output variable, and (5) the classification algorithm. Experimental study results indicate the following: decision tree performs better than the others when the relationship between output and input variables is simple while logistic regression is better when the other way is around; and neural network appears a better choice than the others when the randomness in the relationship is relatively large. We also use Taguchi design to improve the practicality of our study results by letting the relationship between the output and input variables as a noise factor. As a result, the classification accuracy of neural network and decision tree turns out to be higher than that of logistic regression, when the categorical proportion of the output variable is even.

  • PDF

A multivariate latent class profile analysis for longitudinal data with a latent group variable

  • Lee, Jung Wun;Chung, Hwan
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.1
    • /
    • pp.15-35
    • /
    • 2020
  • In research on behavioral studies, significant attention has been paid to the stage-sequential process for multiple latent class variables. We now explore the stage-sequential process of multiple latent class variables using the multivariate latent class profile analysis (MLCPA). A latent profile variable, representing the stage-sequential process in MLCPA, is formed by a set of repeatedly measured categorical response variables. This paper proposes the extended MLCPA in order to explain an association between the latent profile variable and the latent group variable as a form of a two-dimensional contingency table. We applied the extended MLCPA to the National Longitudinal Survey on Youth 1997 (NLSY97) data to investigate the association between of developmental progression of depression and substance use behaviors among adolescents who experienced Authoritarian parental styles in their youth.

Optimal Process Condition for Products with Multi-Categorical Ordinal Quality Characteristic (다범주 순서형 품질특성을 갖는 제품의 최적 공정조건 결정에 관한 연구)

  • Kim Sang-Cheol;Yun Won-Young;Chun Young-Rok
    • Journal of Korean Society for Quality Management
    • /
    • v.32 no.3
    • /
    • pp.109-125
    • /
    • 2004
  • This paper deals with an optimal process control problem in production of hull structural steel plate with high defective rate. The main quality characteristic(dependent variable) is the internal quality(defect) of plates and is dependent on process parameters(independent variables). The dependent variable(quality characteristics) has three categorical ordinal data and there are 35 independent variables(29 continuous variables and 6 categorical variables). In this paper, we determine the main factors and to develop the mathematical model between internal quality predicted probabilities and the main factors. Secondly, we find out the optimal process condition of main factors through analysis of variance(ANOVA) using simulation. We consider three models to obtain the main factors and the optimal process condition: linear, quadratic, error models.

F0 Extrema Timing of HL and LH in North Kyungsang Korean: Evidence from a Mimicry Task

  • Kim, Jung-Sun
    • Phonetics and Speech Sciences
    • /
    • v.4 no.3
    • /
    • pp.43-49
    • /
    • 2012
  • This paper describes the categorical effects of pitch accent contrasts in a mimicry task. It focuses, specifically, on examining how fundamental frequency (f0) variation reflects phonological contrasts from speakers of two distinct varieties of Korean (i.e., North Kyungsang and South Cholla). The results showed that, in a mimicry task using synthetic speech continua, there was a categorical effect in f0 peak timing for North Kyungsang speakers, but the timing of f0 peaks and valleys in the responses of South Cholla speakers was more variable, presenting a gradient or non-categorical effect. Evidence of categorical effects was represented as the shift of f0 peak times along an acoustic continuum for North Kyungsang speakers. The range for the shift of f0 valley times was much narrower, compared to that of f0 peak times. The degree of a shift near the middle of the continuum showed variability across individual mimicry responses. However, the categorical structure in mimicry responses regarding the clustering of f0 peak points was more significant for North Kyungsang speakers than for South Cholla speakers. Additionally, the finding of the current study implies that the location of f0 peak times depends on individuals' imitative (or cognitive) abilities.