Go to the main menu
Skip to content
Go to bottom
REFERENCE LINKING PLATFORM OF KOREA S&T JOURNALS
> Journal Vol & Issue
Journal of the Korean Data and Information Science Society
Journal Basic Information
Journal DOI :
Korean Data and Information Science Society
Editor in Chief :
Volume & Issues
Volume 25, Issue 6 - Nov 2014
Volume 25, Issue 5 - Sep 2014
Volume 25, Issue 4 - Jul 2014
Volume 25, Issue 3 - May 2014
Volume 25, Issue 2 - Mar 2014
Volume 25, Issue 1 - Jan 2014
Selecting the target year
Comparison and analysis of multiple testing methods for microarray gene expression data
Seo, Sumin ; Kim, Tae Houn ; Kim, Jaehee ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 971~986
DOI : 10.7465/jkdi.2014.25.5.971
When thousands of hypotheses are tested simultaneously, the probability of rejecting any true hypotheses increases, and large multiplicity problems are generated. To solve these problems, researchers have proposed different approaches to multiple testing methods, considering family-wise error rate (FWER), false discovery rate (FDR) or false nondiscovery rate (FNR) as a type I error and some test statistics. In this article, we discuss Bonferroni (1960), Holm (1979), Benjamini and Hochberg (1995) and Benjamini and Yekutieli (2001) procedures based on T statistics, modified T statistics or local-pooled-error (LPE) statistics. We also consider Sun and Cai (2007) procedure based on Z statistics. These procedures are compared in the simulation and applied to Arabidopsis microarray gene expression data to identify differentially expressed genes.
Derivation of benchmark dose lower limit of lead for ADHD based on a longitudinal cohort data set
Kim, Byung Soo ; Kim, Daehee ; Ha, Mina ; Kwon, Ho-Jang ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 987~998
DOI : 10.7465/jkdi.2014.25.5.987
The primary purpose of this paper is to derive a benchmark dose lower limit (BMDL) of lead for the attention deficit/hyperactive disorder (ADHD) based on a longitudinal cohort data set which is referred to as CHEER data set. The CHEER data were recently recruited from the Ministry of Environment of S. Korea to investigate the effect of environment on children`s health We first confirm the correlation of ADHD with the blood lead level using a linear mixed effect model. We report from the longitudinal characteristic of CHEER data that ADHD scores tend to have "regression to the mean". A dose-response curve of blood lead level with ADHD being the end point is derived and from this dose-response curve a few BMDLs are derived based on corresponding assumptions on the benchmark region.
Microarray data analysis using relative hierarchical clustering
Woo, Sook Young ; Lee, Jae Won ; Jhun, Myoungshic ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 999~1009
DOI : 10.7465/jkdi.2014.25.5.999
Hierarchical clustering analysis helps easily exploring massive microarray data and understanding biological phenomena with dendrogram. But, because hierarchical clustering algorithms only consider the absolute similarity, it is difficult to illustrate a relative dissimilarity, which consider not only the distance between a pair of clusters, but also how distant are they from the rest of the clusters. In this study, we introduced the relative hierarchical clustering method proposed by Mollineda and Vidal (2000) and compared hierarchical clustering method and relative hierarchical method using the simulated data and the real data in the various situations. The evaluation of the quality of two hierarchical methods was performed using percentage of incorrectly grouped points (PIGP), homogeneity and separation.
Major SNP identification for oleic acid and marbling score which are associated with Korean cattle
Oh, Dong-Yep ; Yeo, Jung-Sou ; Lee, Jea-Young ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1011~1024
DOI : 10.7465/jkdi.2014.25.5.1011
This study is to identify the relationship between unsaturated fatty acids, which are indicators of beef flavor, and unsaturated fatty acid biosynthetic enzymes, which are associated with SNPs in the SCD, SREBPs,
, FABP4, FASN and LPL in Hanwoo population. For analysis of fatty acid in Hanwoo, we used to Hanwoo steer(n
Efficient strategy for the genetic analysis of related samples with a linear mixed model
Lim, Jeongmin ; Sung, Joohon ; Won, Sungho ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1025~1038
DOI : 10.7465/jkdi.2014.25.5.1025
Linear mixed model has often been utilized for genetic association analysis with family-based samples. The correlation matrix for family-based samples is constructed with kinship coefficient and assumes that parental phenotypes are independent and the amount of correlations between parent and offspring is same as that of correlations between siblings. However, for instance, there are positive correlations between parental heights, which indicates that the assumption for correlation matrix is often violated. The statistical validity and power are affected by the appropriateness of assumed variance covariance matrix, and in this thesis, we provide the linear mixed model with flexible variance covariance matrix. Our results show that the proposed method is usually more efficient than existing approaches, and its application to genome-wide association study of body mass index illustrates the practical value in real data analysis.
Independence tests using coin package in R
Kim, Jinheum ; Lee, Jung-Dong ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1039~1055
DOI : 10.7465/jkdi.2014.25.5.1039
The distribution of a test statistic under a null hypothesis depends on the unknown distribution of the data and thus is unknown as well. Conditional tests replace the unknown null distribution by the conditional null distribution, that is, the distribution of the test statistic given the observed data. This approach is known as permutation tests and was developed by Fisher (Fisher, 1935). Theoretical framework for permutation tests was given by Strasser and Weber(1999). The coin package developed by Hothon et al. (2006, 2008) implements a unified approach for conditional inference via the generic independence test. Because convenient functions for the most prominent problems are available, users will not have to use the extremely flexible procedure. In this article we briefly review the underlying theory from Strasser and Weber (1999) and explain how to transform the data to perform the generic function independence test. Finally it was illustrated with a few real data sets.
An extension of multifactor dimensionality reduction method for detecting gene-gene interactions with the survival time
Oh, Jin Seok ; Lee, Seung Yeoun ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1057~1067
DOI : 10.7465/jkdi.2014.25.5.1057
Many genetic variants have been identified to be associated with complex diseases such as hypertension, diabetes and cancers throughout genome-wide association studies (GWAS). However, there still exist a serious missing heritability problem since the proportion explained by genetic variants from GWAS is very weak less than 10~15%. Gene-gene interaction study may be helpful to explain the missing heritability because most of complex disease mechanisms are involved with more than one single SNP, which include multiple SNPs or gene-gene interactions. This paper focuses on gene-gene interactions with the survival phenotype by extending the multifactor dimensionality reduction (MDR) method to the accelerated failure time (AFT) model. The standardized residual from AFT model is used as a residual score for classifying multiple geno-types into high and low risk groups and algorithm of MDR is implemented. We call this method AFT-MDR and compares the power of AFT-MDR with those of Surv-MDR and Cox-MDR in simulation studies. Also a real data for leukemia Korean patients is analyzed. It was found that the power of AFT-MDR is greater than that of Surv-MDR and is comparable with that of Cox-MDR, but is very sensitive to the censoring fraction.
Functional clustering for clubfoot data: A case study
Lee, Miae ; Lim, Johan ; Park, Chungun ; Lee, Kyeong Eun ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1069~1077
DOI : 10.7465/jkdi.2014.25.5.1069
A clubfoot is a kind of congenital deformity of foot, which is internally rotated at the ankle. In this paper, we are going to cluster the curves of relative differences between regular and operated feet. Since these curves are irregular and sparsely sampled, general clustering models could not be applied. So the clustering model for sparsely sampled functional data by James and Sugar (2003) are applied and parameters are estimated using EM algorithm. The number of clusters is determined by the distortion function (Sugar and James, 2003) and two clusters of the curves are found.
Efficient variable selection method using conditional mutual information
Ahn, Chi Kyung ; Kim, Donguk ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1079~1094
DOI : 10.7465/jkdi.2014.25.5.1079
In this paper, we study efficient gene selection methods by using conditional mutual information. We suggest gene selection methods using conditional mutual information based on semiparametric methods utilizing multivariate normal distribution and Edgeworth approximation. We compare our suggested methods with other methods such as mutual information filter, SVM-RFE, Cai et al. (2009)`s gene selection (MIGS-original) in SVM classification. By these experiments, we show that gene selection methods using conditional mutual information based on semiparametric methods have better performance than mutual information filter. Furthermore, we show that they take far less computing time than Cai et al. (2009)`s gene selection but have similar performance.
Symbolic tree based model for HCC using SNP data
Lee, Tae Rim ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1095~1106
DOI : 10.7465/jkdi.2014.25.5.1095
Symbolic data analysis extends the data mining and exploratory data analysis to the knowledge mining, we can suggest the SDA tree model on clinical and genomic data with new knowledge mining SDA approach. Using SDA application for huge genomic SNP data, we can get the correlation the availability of understanding of hidden structure of HCC data could be proved. We can confirm validity of application of SDA to the tree structured progression model and to quantify the clinical lab data and SNP data for early diagnosis of HCC. Our proposed model constructs the representative model for HCC survival time and causal association with their SNP gene data. To fit the simple and easy interpretation tree structured survival model which could reduced from huge clinical and genomic data under the new statistical theory of knowledge mining with SDA.
Recent developments of constructing adjacency matrix in network analysis
Hong, Younghee ; Kim, Choongrak ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1107~1116
DOI : 10.7465/jkdi.2014.25.5.1107
In this paper, we review recent developments in network analysis using the graph theory, and introduce ongoing research area with relevant theoretical results. In specific, we introduce basic notations in graph, and conditional and marginal approach in constructing the adjacency matrix. Also, we introduce the Marcenko-Pastur law, the Tracy-Widom law, the white Wishart distribution, and the spiked distribution. Finally, we mention the relationship between degrees and eigenvalues for the detection of hubs in a network.
Analysis of recurrent event data with incomplete observation gaps using piecewise models
Kim, Yang-Jin ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1117~1125
DOI : 10.7465/jkdi.2014.25.5.1117
In a longitudinal study, subjects can experience same type of events repeatedly. Also, there may exist intermittent dropouts resulting in repeated observation gaps during which no recurrent events are observed. Furthermore, when such observation gaps have incomplete forms caused by the unknown termination times of observation gaps, ordinary approaches result in biased estimates. In this study, we investigate the effect of ignoring observation gaps and propose methods to overcome this problem. For estimating the distribution of unknown termination times, an interval-censored mechanism is applied and two cases are considered. Simulation studies are carried out to evaluate the performance of the proposed method. Conviction data of young drivers with several suspensions are analyzed to illustrate the suggested approach.
Adjusting sampling bias in case-control genetic association studies
Seo, Geum Chu ; Park, Taesung ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1127~1135
DOI : 10.7465/jkdi.2014.25.5.1127
Genome-wide association studies (GWAS) are designed to discover genetic variants such as single nucleotide polymorphisms (SNPs) that are associated with human complex traits. Although there is an increasing interest in the application of GWAS methodologies to population-based cohorts, many published GWAS have adopted a case-control design, which raise an issue related to a sampling bias of both case and control samples. Because of unequal selection probabilities between cases and controls, the samples are not representative of the population that they are purported to represent. Therefore, non-random sampling in case-control study can potentially lead to inconsistent and biased estimates of SNP-trait associations. In this paper, we proposed inverse-probability of sampling weights based on disease prevalence to eliminate a case-control sampling bias in estimation and testing for association between SNPs and quantitative traits. We apply the proposed method to a data from the Korea Association Resource project and show that the standard estimators applied to the weighted data yield unbiased estimates.
Contemporary review on the bifurcating autoregressive models : Overview and perspectives
Hwang, S.Y. ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1137~1149
DOI : 10.7465/jkdi.2014.25.5.1137
Since the bifurcating autoregressive (BAR) model was developed by Cowan and Staudte (1986) to analyze cell lineage data, a lot of research has been directed to BAR and its generalizations. Based mainly on the author`s works, this paper is concerned with a contemporary review on the BAR in terms of an overview and perspectives. Specifically, bifurcating structure is extended to multi-cast tree and to branching tree structure. The AR(1) time series model of Cowan and Staudte (1986) is generalized to tree structured random processes. Branching correlations between individuals sharing the same parent are introduced and discussed. Various methods for estimating parameters and related asymptotics are also reviewed. Consequently, the paper aims to give a contemporary overview on the BAR model, providing some perspectives to the future works in this area.
A modified partial least squares regression for the analysis of gene expression data with survival information
Lee, So-Yoon ; Huh, Myung-Hoe ; Park, Mira ;
Journal of the Korean Data and Information Science Society, volume 25, issue 5, 2014, Pages 1151~1160
DOI : 10.7465/jkdi.2014.25.5.1151
In DNA microarray studies, the number of genes far exceeds the number of samples and the gene expression measures are highly correlated. Partial least squares regression (PLSR) is one of the popular methods for dimensional reduction and known to be useful for the classifications of microarray data by several studies. In this study, we suggest a modified version of the partial least squares regression to analyze gene expression data with survival information. The method is designed as a new gene selection method using PLSR with an iterative procedure of imputing censored survival time. Mean square error of prediction criterion is used to determine the dimension of the model. To visualize the data, plot for variables superimposed with samples are used. The method is applied to two microarray data sets, both containing survival time. The results show that the proposed method works well for interpreting gene expression microarray data.