• Title, Summary, Keyword: variable selection

Search Result 754, Processing Time 0.034 seconds

A New Variable Selection Method Based on Mutual Information Maximization by Replacing Collinear Variables for Nonlinear Quantitative Structure-Property Relationship Models

  • Ghasemi, Jahan B.;Zolfonoun, Ehsan
    • Bulletin of the Korean Chemical Society
    • /
    • v.33 no.5
    • /
    • pp.1527-1535
    • /
    • 2012
  • Selection of the most informative molecular descriptors from the original data set is a key step for development of quantitative structure activity/property relationship models. Recently, mutual information (MI) has gained increasing attention in feature selection problems. This paper presents an effective mutual information-based feature selection approach, named mutual information maximization by replacing collinear variables (MIMRCV), for nonlinear quantitative structure-property relationship models. The proposed variable selection method was applied to three different QSPR datasets, soil degradation half-life of 47 organophosphorus pesticides, GC-MS retention times of 85 volatile organic compounds, and water-to-micellar cetyltrimethylammonium bromide partition coefficients of 62 organic compounds.The obtained results revealed that using MIMRCV as feature selection method improves the predictive quality of the developed models compared to conventional MI based variable selection algorithms.

A Study on Unbiased Methods in Constructing Classification Trees

  • Lee, Yoon-Mo;Song, Moon Sup
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.3
    • /
    • pp.809-824
    • /
    • 2002
  • we propose two methods which separate the variable selection step and the split-point selection step. We call these two algorithms as CHITES method and F&CHITES method. They adapted some of the best characteristics of CART, CHAID, and QUEST. In the first step the variable, which is most significant to predict the target class values, is selected. In the second step, the exhaustive search method is applied to find the splitting point based on the selected variable in the first step. We compared the proposed methods, CART, and QUEST in terms of variable selection bias and power, error rates, and training times. The proposed methods are not only unbiased in the null case, but also powerful for selecting correct variables in non-null cases.

A Variable Selection Procedure for K-Means Clustering

  • Kim, Sung-Soo
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.3
    • /
    • pp.471-483
    • /
    • 2012
  • One of the most important problems in cluster analysis is the selection of variables that truly define cluster structure, while eliminating noisy variables that mask such structure. Brusco and Cradit (2001) present VS-KM(variable-selection heuristic for K-means clustering) procedure for selecting true variables for K-means clustering based on adjusted Rand index. This procedure starts with the fixed number of clusters in K-means and adds variables sequentially based on an adjusted Rand index. This paper presents an updated procedure combining the VS-KM with the automated K-means procedure provided by Kim (2009). This automated variable selection procedure for K-means clustering calculates the cluster number and initial cluster center whenever new variable is added and adds a variable based on adjusted Rand index. Simulation result indicates that the proposed procedure is very effective at selecting true variables and at eliminating noisy variables. Implemented program using R can be obtained on the website "http://faculty.knou.ac.kr/sskim/nvarkm.r and vnvarkm.r".

A procedure for simultaneous variable selection, variable transformation and outlier identification in linear regression (선형회귀에서 변수선택, 변수변환과 이상치 탐지의 동시적 수행을 위한 절차)

  • Seo, Han Son;Yoon, Min
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.1
    • /
    • pp.1-10
    • /
    • 2020
  • We propose a unified approach to variable selection, transformation and outliers in the linear model. The procedure includes a sequential method for outlier detection and a least trimmed squares estimator for variable transformation. It uses all possible subsets regressions for model selection. Some real data analyses and the simulation results are provided to show the efficiency of the methods in the context of the correct variable selection and the fitness of the estimated model.

Discretization Method Based on Quantiles for Variable Selection Using Mutual Information

  • CHa, Woon-Ock;Huh, Moon-Yul
    • Communications for Statistical Applications and Methods
    • /
    • v.12 no.3
    • /
    • pp.659-672
    • /
    • 2005
  • This paper evaluates discretization of continuous variables to select relevant variables for supervised learning using mutual information. Three discretization methods, MDL, Histogram and 4-Intervals are considered. The process of discretization and variable subset selection is evaluated according to the classification accuracies with the 6 real data sets of UCI databases. Results show that 4-Interval discretization method based on quantiles, is robust and efficient for variable selection process. We also visually evaluate the appropriateness of the selected subset of variables.

A study on the Predictors of criteria on Clothing Selection (의복선택기준 예측변인 연구)

  • Shin, Jeong-Won;Park, Eun-Joo
    • Journal of the Korean Society of Costume
    • /
    • v.13
    • /
    • pp.123-134
    • /
    • 1989
  • The purpose of this study was to identify the predictable variables of criteria on clothing selection. Relationships among criteria on clothing selection, psychological variable, lifestyle variable, and demographic variable were tested by Pearsons' correlation coefficients and One-way ANOVA. The predictors of criteria on clothing selection were identified by Regression. The consumers were classified into several benefit-segments by criteria on clothing selection, and then, the character of each segment were identified by Multiple Discriminant Analysis. Data was obtained from 593 women living in Pusan by self-administered questionnaires. The results of the study were as follows; 1. Relationship between criteria on clothing selection and relative variables. 1) The important variables to criteria on clothing selection were "down-to-earth-sophisticated", "traditional-morden", "conventional-different", "conscientious-expendient", need for exhibitionism, need for sex, fashion / appearance. 2) The important factor of clothing selection criteria was comfort and it has significant difference among ages. 3) The higher of social-economic status have the more appearance-oriented selection. 2. Predictors of criteria on clothing selection. There were several important predictors of criteria on clothing selection like lifestyle, need, and self-image. Especially, fashion / appearance in lifestyle variable was very important. 3. Segmentation by the criteria on clothing selection. There are four groups Classified by the criteria on clothing selection, that is practical-oriented group, appearance-oriented group, practical and appearance-oriented group, and indifference group. The significant discriminative variables were Fashion / appearance factor, need for exhibitionism, and need for sex. The result of this study can be used for a enterprise to analysis the consumer and to build the strategy of advertisement clothing.

  • PDF

Variable selection in Poisson HGLMs using h-likelihoood

  • Ha, Il Do;Cho, Geon-Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.6
    • /
    • pp.1513-1521
    • /
    • 2015
  • Selecting relevant variables for a statistical model is very important in regression analysis. Recently, variable selection methods using a penalized likelihood have been widely studied in various regression models. The main advantage of these methods is that they select important variables and estimate the regression coefficients of the covariates, simultaneously. In this paper, we propose a simple procedure based on a penalized h-likelihood (HL) for variable selection in Poisson hierarchical generalized linear models (HGLMs) for correlated count data. For this we consider three penalty functions (LASSO, SCAD and HL), and derive the corresponding variable-selection procedures. The proposed method is illustrated using a practical example.

Analysis of Client Propensity in Cyber Counseling Using Bayesian Variable Selection

  • Pi, Su-Young
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.6 no.4
    • /
    • pp.277-281
    • /
    • 2006
  • Cyber counseling, one of the most compatible type of consultation for the information society, enables people to reveal their mental agonies and private problems anonymously, since it does not require face-to-face interview between a counsellor and a client. However, there are few cyber counseling centers which provide high quality and trustworthy service, although the number of cyber counseling center has highly increased. Therefore, this paper is intended to enable an appropriate consultation for each client by analyzing client propensity using Bayesian variable selection. Bayesian variable selection is superior to stepwise regression analysis method in finding out a regression model. Stepwise regression analysis method, which has been generally used to analyze individual propensity in linear regression model, is not efficient since it is hard to select a proper model for its own defects. In this paper, based on the case database of current cyber counseling centers in the web, we will analyze clients' propensities using Bayesian variable selection to enable individually target counseling and to activate cyber counseling programs.

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.1
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

An Additive Sparse Penalty for Variable Selection in High-Dimensional Linear Regression Model

  • Lee, Sangin
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.2
    • /
    • pp.147-157
    • /
    • 2015
  • We consider a sparse high-dimensional linear regression model. Penalized methods using LASSO or non-convex penalties have been widely used for variable selection and estimation in high-dimensional regression models. In penalized regression, the selection and prediction performances depend on which penalty function is used. For example, it is known that LASSO has a good prediction performance but tends to select more variables than necessary. In this paper, we propose an additive sparse penalty for variable selection using a combination of LASSO and minimax concave penalties (MCP). The proposed penalty is designed for good properties of both LASSO and MCP.We develop an efficient algorithm to compute the proposed estimator by combining a concave convex procedure and coordinate descent algorithm. Numerical studies show that the proposed method has better selection and prediction performances compared to other penalized methods.