DOI QR코드

DOI QR Code

Estimation of Gini-Simpson index for SNP data

  • Kang, Joonsung (Department of Information Statistics, Gangneung-Wonju National University)
  • Received : 2017.09.29
  • Accepted : 2017.11.01
  • Published : 2017.11.30

Abstract

We take genomic sequences of high-dimensional low sample size (HDLSS) without ordering of response categories into account. When constructing an appropriate test statistics in this model, the classical multivariate analysis of variance (MANOVA) approach might not be useful owing to very large number of parameters and very small sample size. For these reasons, we present a pseudo marginal model based upon the Gini-Simpson index estimated via Bayesian approach. In view of small sample size, we consider the permutation distribution by every possible n! (equally likely) permutation of the joined sample observations across G groups of (sizes $n_1,{\ldots}n_G$). We simulate data and apply false discovery rate (FDR) and positive false discovery rate (pFDR) with associated proposed test statistics to the data. And we also analyze real SARS data and compute FDR and pFDR. FDR and pFDR procedure along with the associated test statistics for each gene control the FDR and pFDR respectively at any level ${\alpha}$ for the set of p-values by using the exact conditional permutation theory.

Keywords

References

  1. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57, 289-300.
  2. Benjamini, Y. and Liu, W. (1999). A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence. Journal of Statistical Planning and Inference, 82, 163-170. https://doi.org/10.1016/S0378-3758(99)00040-3
  3. Gini, C. (1921). Measurement of inequality of incomes. The Economic Journal, 31(121), 124-126. https://doi.org/10.2307/2223319
  4. Jang, W. (2013). Multiple testing and its applications in high-dimension. Journal of Korean Data and Information Science Society, 24, 1063-1076. https://doi.org/10.7465/jkdi.2013.24.5.1063
  5. Kang, S. (2015). Default Bayesian testing for scale parameters in the log-logistic distributions. Journal of Korean Data & Information Science Society, 26(6), 1501-1511. https://doi.org/10.7465/jkdi.2015.26.6.1501
  6. Kim, S. and Lee, Y. (2016). The estimation of winning rate in Korean professional baseball league. Journal of the Korean Data & Information Science Society, 27(3), 653-661. https://doi.org/10.7465/jkdi.2016.27.3.653
  7. Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Annals of Statistics, 30, 239-257. https://doi.org/10.1214/aos/1015362192
  8. Sen, P. K. (2005). Gini diversity index, Hamming distance and curse of dimensionality. Metron, LXIII, 329-349.
  9. Seo, S., Kim, T., and Kim, J. (2014). Comparison and analysis of multiple testing methods for microarray gene expression data. Journal of Korean Data & Information Science Society, 25, 971-986. https://doi.org/10.7465/jkdi.2014.25.5.971
  10. Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. https://doi.org/10.1038/163688a0
  11. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B, 64, 479-498. https://doi.org/10.1111/1467-9868.00346
  12. Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics, 31, 2013-2035. https://doi.org/10.1214/aos/1074290335
  13. Tsai, M. T. and Sen, P.K. (2010). Entropy based constrained inference for some HDLSS genomic models: UI tests in a Chen-Stein perspective. Journal of Multivariate Analysis, 101, 1559-1573. https://doi.org/10.1016/j.jmva.2010.03.004