A Study on Gene Search Using Test for Interval Data

구간형 데이터 검정법을 이용한 유전자 탐색에 관한 연구

  • 이성건 (성신여자대학교 통계학과)
  • Received : 2018.11.20
  • Accepted : 2018.12.20
  • Published : 2018.12.31


The methylation score, expressed as a percentage of the methylation status data derived from the iterative sequencing process, has a value between 0 and 1. It is contrary to the assumption of normal distribution that simply applying the t-test to examine the difference in population-specific methylation scores in these data. In addition, since the result may vary depending on the number of repetitions of sequencing in the process of methylation score generation, a method that can analyze such errors is also necessary. In this paper, we introduce the symbolic data analysis and the interval K-S test method which convert observation data into interval data including uncertainty rather than one numerical data. In addition, it is possible to analyze the characteristics of methylation score by using Beta distribution without using normal distribution in the process of converting into interval data. For the data analysis, the nature of the proposed method was examined using sequencing data of actual patients and normal persons. While the t-test is only possible for the location test, it is found that the interval type K-S statistic can be used to test not only the location parameter but also the heterogeneity of the distribution function.


Supported by : 성신여자대학교


  1. Billard, L., Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons, New Jersey.
  2. Hedjazi, L., Lann, M., Kempowsky, T., Dalence, F., Agular-Martin, J., Favre, G. (2013). Symbolic data analysis to defy low signal-to-noise ratio in microarray data for breast cancer prognosis, Journal of Computational Biology, 20(8), 610-620. https://doi.org/10.1089/cmb.2012.0249
  3. Hlady, R., Tiedemann, R., Puszyk, W., Zendejas, I., Roberts, L. R., Choi, J., Liu, C., Robertson, K. (2014). Epigenetic signatures of alcohol abuse and hepatitis infection during human hepatocarcinogenesis, Oncotarget, 5(19), 9425-9443.
  4. Hwang, Y., Kang, C., Kim, K., Choi, S. (2013). A study of exploring disease-related genes using social network analysis, Journal of the Korean Data Analysis Society, 15(2), 677-684. (in Korean).
  5. Kang, G., Kim, K., Kang, C. (2014). A study of cancer-related gene exploration using PCA logistic regression, Journal of the Korean Data Analysis Society, 16(3), 1241-1248. (in Korean).
  6. Lee, S. (2016). A study on two sample test for interval-valued symbolic data, Journal of the Korean Data Analysis Society, 18(6), 2871-2878.
  7. Lee, S. (2017a). Decision tree for interval valued symbolic response using K-S statistics, Journal of the Korean Data Analysis Society, 19(4), 1821-1829. (in Korean).
  8. Lee, S. (2017b). Comparison of the two distributions based on interval valued data, Journal of the Korean Data Analysis Society, 18(6), 3023-3031. (in Korean).
  9. Lee, S., Piao, Y., Shi, H., Choi, J. (2015). New approaches to identify cancer heterogeneity in DNA methylation studies using the Lepage test and multinomial logistic regression, 2015 Proceeding of IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 1-7.