• Title/Summary/Keyword: high dimensional data sets

Search Result 71, Processing Time 0.027 seconds

Extended High Dimensional Clustering using Iterative Two Dimensional Projection Filtering (반복적 2차원 프로젝션 필터링을 이용한 확장 고차원 클러스터링)

  • Lee, Hye-Myeong;Park, Yeong-Bae
    • The KIPS Transactions:PartD
    • /
    • v.8D no.5
    • /
    • pp.573-580
    • /
    • 2001
  • The large amounts of high dimensional data contains a significant amount of noises by it own sparsity, which adds difficulties in high dimensional clustering. The CLIP is developed as a clustering algorithm to support characteristics of the high dimensional data. The CLIP is based on the incremental one dimensional projection on each axis and find product sets of the dimensional clusters. These product sets contain not only all high dimensional clusters but also they may contain noises. In this paper, we propose extended CLIP algorithm which refines the product sets that contain cluster. We remove high dimensional noises by applying two dimensional projections iteratively on the already found product sets by CLIP. To evaluate the performance of extended algorithm, we demonstrate its effectiveness through a series of experiments on synthetic data sets.

  • PDF

An SVD-Based Approach for Generating High-Dimensional Data and Query Sets (SVD를 기반으로 한 고차원 데이터 및 질의 집합의 생성)

  • 김상욱
    • The Journal of Information Technology and Database
    • /
    • v.8 no.2
    • /
    • pp.91-101
    • /
    • 2001
  • Previous research efforts on performance evaluation of multidimensional indexes typically have used synthetic data sets distributed uniformly or normally over multidimensional space. However, recent research research result has shown that these hinds of data sets hardly reflect the characteristics of multimedia database applications. In this paper, we discuss issues on generating high dimensional data and query sets for resolving the problem. We first identify the features of the data and query sets that are appropriate for fairly evaluating performances of multidimensional indexes, and then propose HDDQ_Gen(High-Dimensional Data and Query Generator) that satisfies such features. HDDQ_Gen supports the following features : (1) clustered distributions, (2) various object distributions in each cluster, (3) various cluster distributions, (4) various correlations among different dimensions, (5) query distributions depending on data distributions. Using these features, users are able to control tile distribution characteristics of data and query sets. Our contribution is fairly important in that HDDQ_Gen provides the benchmark environment evaluating multidimensional indexes correctly.

  • PDF

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

  • Mehmood, Tahir;Rasheed, Zahid
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.6
    • /
    • pp.575-587
    • /
    • 2015
  • The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.

Similarity Measure Design on High Dimensional Data

  • Nipon, Theera-Umpon;Lee, Sanghyuk
    • Journal of the Korea Convergence Society
    • /
    • v.4 no.1
    • /
    • pp.43-48
    • /
    • 2013
  • Designing of similarity on high dimensional data was done. Similarity measure between high dimensional data was considered by analysing neighbor information with respect to data sets. Obtained result could be applied to big data, because big data has multiple characteristics compared to simple data set. Definitely, analysis of high dimensional data could be the pre-study of big data. High dimensional data analysis was also compared with the conventional similarity. Traditional similarity measure on overlapped data was illustrated, and application to non-overlapped data was carried out. Its usefulness was proved by way of mathematical proof, and verified by calculation of similarity for artificial data example.

Flow Visualization Model Based on B-spline Volume (비스플라인 부피에 기초한 유동 가시화 모델)

  • 박상근;이건우
    • Korean Journal of Computational Design and Engineering
    • /
    • v.2 no.1
    • /
    • pp.11-18
    • /
    • 1997
  • Scientific volume visualization addresses the representation, manipulation, and rendering of volumetric data sets, providing mechanisms for looking closely into structures and understanding their complexity and dynamics. In the past several years, a tremendous amount of research and development has been directed toward algorithms and data modeling methods for a scientific data visualization. But there has been very little work on developing a mathematical volume model that feeds this visualization. Especially, in flow visualization, the volume model has long been required as a guidance to display the very large amounts of data resulting from numerical simulations. In this paper, we focus on the mathematical representation of volumetric data sets and the method of extracting meaningful information from the derived volume model. For this purpose, a B-spline volume is extended to a high dimensional trivariate model which is called as a flow visualization model in this paper. Two three-dimensional examples are presented to demonstrate the capabilities of this model.

  • PDF

Feature reduction for classifying high dimensional data sets using support vector machine (고차원 데이터의 분류를 위한 서포트 벡터 머신을 이용한 피처 감소 기법)

  • Ko, Seok-Ha;Lee, Hyun-Ju
    • Proceedings of the IEEK Conference
    • /
    • 2008.06a
    • /
    • pp.877-878
    • /
    • 2008
  • We suggest a feature reduction method to classify mouse function data sets, which integrate several biological data sets represented as high dimensional vectors. To increase classification accuracy and decrease computational overhead, it is important to reduce the dimension of features. To do this, we employed Hybrid Huberized Support Vector Machine with kernels used for a kernel logistic regression method. When compared to support vector machine, this a pproach shows the better accuracy with useful features for each mouse function.

  • PDF

Canonical Correlation Biplot

  • Park, Mi-Ra;Huh, Myung-Hoe
    • Communications for Statistical Applications and Methods
    • /
    • v.3 no.1
    • /
    • pp.11-19
    • /
    • 1996
  • Canonical correlation analysis is a multivariate technique for identifying and quantifying the statistical relationship between two sets of variables. Like most multivariate techniques, the main objective of canonical correlation analysis is to reduce the dimensionality of the dataset. It would be particularly useful if high dimensional data can be represented in a low dimensional space. In this study, we will construct statistical graphs for paired sets of multivariate data. Specifically, plots of the observations as well as the variables are proposed. We discuss the geometric interpretation and goodness-of-fit of the proposed plots. We also provide a numerical example.

  • PDF

Demension reduction for high-dimensional data via mixtures of common factor analyzers-an application to tumor classification

  • Baek, Jang-Sun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.19 no.3
    • /
    • pp.751-759
    • /
    • 2008
  • Mixtures of factor analyzers(MFA) is useful to model the distribution of high-dimensional data on much lower dimensional space where the number of observations is very large relative to their dimension. Mixtures of common factor analyzers(MCFA) can reduce further the number of parameters in the specification of the component covariance matrices as the number of classes is not small. Moreover, the factor scores of MCFA can be displayed in low-dimensional space to distinguish the groups. We propose the factor scores of MCFA as new low-dimensional features for classification of high-dimensional data. Compared with the conventional dimension reduction methods such as principal component analysis(PCA) and canonical covariates(CV), the proposed factor score was shown to have higher correct classification rates for three real data sets when it was used in parametric and nonparametric classifiers.

  • PDF

High-Dimensional Clustering Technique using Incremental Projection (점진적 프로젝션을 이용한 고차원 글러스터링 기법)

  • Lee, Hye-Myung;Park, Young-Bae
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.568-576
    • /
    • 2001
  • Most of clustering algorithms data to degenerate rapidly on high dimensional spaces. Moreover, high dimensional data often contain a significant a significant of noise. which causes additional ineffectiveness of algorithms. Therefore it is necessary to develop algorithms adapted to the structure and characteristics of the high dimensional data. In this paper, we propose a clustering algorithms CLIP using the projection The CLIP is designed to overcome efficiency and/or effectiveness problems on high dimensional clustering and it is the is based on clustering on each one dimensional subspace but we use the incremental projection to recover high dimensional cluster and to reduce the computational cost significantly at time To evaluate the performance of CLIP we demonstrate is efficiency and effectiveness through a series of experiments on synthetic data sets.

  • PDF

A Feature Vector Selection Method for Cancer Classification

  • Yun, Zheng;Keong, Kwoh-Chee
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2005.09a
    • /
    • pp.23-28
    • /
    • 2005
  • The high-dimensionality and insufficiency of gene expression profiles and proteomic profiles makes feature selection become a critical step in efficiently building accurate models for cancer problems based on such data sets. In this paper, we use a method, called Discrete Function Learning algorithm, to find discriminatory feature vectors based on information theory. The target feature vectors contain all or most information (in terms of entropy) of the class attribute. Two data sets are selected to validate our approach, one leukemia subtype gene expression data set and one ovarian cancer proteomic data set. The experimental results show that the our method generalizes well when applied to these insufficient and high-dimensional data sets. Furthermore, the obtained classifiers are highly understandable and accurate.

  • PDF