Evaluation of the classification method using ancestry SNP markers for ethnic group

  • Received : 2018.01.10
  • Accepted : 2018.12.13
  • Published : 2019.01.31


Various probabilistic methods have been proposed for using interpopulation allele frequency differences to infer the ethnic group of a DNA specimen. The selection of the statistical method is critical because the accuracy of the statistical classification results vary. For the ancestry classification, we proposed a new ancestry evaluation method that estimate the combined ethnicity index as well as compared its performance with various classical classification methods using two real data sets. We selected 13 SNPs that are useful for the inference of ethnic origin. These single nucleotide polymorphisms (SNPs) were analyzed by restriction fragment mass polymorphism assay and followed by classification among ethnic groups. We genotyped 400 individuals from four ethnic groups (100 African-American, 100 Caucasian, 100 Korean, and 100 Mexican-American) for 13 SNPs and allele frequencies that differed among the four ethnic groups. Additionally, we applied our new method to HapMap SNP genotypes for 1,011 samples from 4 populations (African, European, East Asian, and Central-South Asian). Our proposed method yielded the highest accuracy among statistical classification methods. Our ethnic group classification system based on the analysis of ancestry informative SNP markers can provide a useful statistical tool to identify ethnic groups.


Supported by : National Research Foundation (NRF)


  1. Altman NS (1992). An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, 46, 175-185.
  2. Bickel PJ and Levina E (2004). Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations, Bernoulli, 10, 989-1010.
  3. Botto LD and Yang Q (2000). 5,10-Methylenetetrahydrofolate reductase gene variants and congenital anomalies: a HuGE review, American Journal of Epidemiology, 151, 862-877.
  4. Bray MS, Boerwinkle E, and Doris PA (2001). High-throughput multiplex SNP genotyping with MALDI-TOF mass spectrometry: practice, problems and promise, Human Mutation, 17, 296-304.
  5. Breiman L (1984). Classification and Regression Trees, Wadsworth International Group, California.
  6. Breiman L (2001). Random forests, Machine Learning, 45, 5-32.
  7. Brenner CH (1998). Difficulties in the estimation of ethnic affiliation, American Journal of Human Genetics, 62, 1558-1560.
  8. Butler JM (2009). Fundamentals of Forensic DNA Typing, Elsevier Science, Burlington.
  9. Dudoit S, Fridlyand J, and Speed TP (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 77-87.
  10. Duffy DL, Montgomery GW, Chen W, et al. (2007). A three-single-nucleotide polymorphism haplotype in intron 1 of OCA2 explains most human eye-color variation, American Journal of Human Genetics, 80, 241-252.
  11. Evett IW, Pinchin R, and Buffery C (1992). An investigation of the feasibility of inferring ethnic origin from DNA profiles, Journal of the Forensic Science Society, 32, 301-306.
  12. Fisher RA (1936). The use of multiple measurements in taxonomic problems, Annals of Human Genetics, 7, 179-188.
  13. Frudakis T, Venkateswarlu K, Thomas MJ, et al. (2003). A classifier for the SNP-based inference of ancestry, Journal of Forensic Science, 48, 771-782.
  14. Graf J, Hodgson R, and van Daal A (2005). Single nucleotide polymorphisms in the MATP gene are associated with normal human pigmentation variation, Human Mutation, 25, 278-284.
  15. Graf J, Voisey J, Hughes I, and van Daal A (2007). Promoter polymorphisms in theMATP (SLC45A2) gene are associated with normal human skin color variation, Human Mutation, 28, 710-717.
  16. Hong SP, Ji SI, Rhee H, et al. (2008). A simple and accurate SNP scoring strategy based on typeIIS restriction endonuclease cleavage and matrix-assisted laser desorption/ionization mass spectrometry, BMC Genomics, 9, 276.
  17. Hwang SH, Oh HB, Choi SE, Hong SP, and Yoo W (2007). Effective screening of informative single nucleotide polymorphisms using the novel method of restriction fragment mass polymorphism, The Journal of International Medical Research, 35, 827-835.
  18. Koda Y, Tachida H, Pang M, Liu Y, Soejima M, Ghaderi AA, Takenaka O, and Kimura H (2001). Contrasting patterns of polymorphisms at the ABO-secretor gene (FUT2) and plasma ${\alpha}$(1, 3) fucosyltransferase gene (FUT6) in human populations, Genetics, 158, 747-756.
  19. Lowe AL, Urquhart A, Foreman LA, and Evett IW (2001). Inferring ethnic origin by means of an STR profile, Forensic Science International, 119, 17-22.
  20. Mountain JL, Knight A, Jobin M, Gignoux C, Miller A, Lin AA, and Underhill PA (2002). SNPSTRs: empirically derived, rapidly typed, autosomal haplotypes for inference of population history and mutational processes, Genome Research, 12, 1766-1772.
  21. Nguyen DV and Rocke DM (2004). On partial least squares dimension reduction for microarray-based classification: a simulation study, Computational Statistics & Data Analysis, 46, 407-425.
  22. Pastinen T and Hudson TJ (2004). Cis-acting regulatory variation in the human genome, Science, 306, 647-650.
  23. Phillips C, Freire AA, Kriegel AK, et al. (2013). Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries, Forensic Science International Genetics, 7, 359-366.
  24. Porras-Hurtado L, Ruiz Y, Santos C, Phillips C, Carracedo A, and Lareu MV (2013). An overview of STRUCTURE: applications, parameter settings, and supporting software, Frontiers in Genetics, 29, 1-13.
  25. Pritchard JK, Stephens M, and Donnelly P (2000). Inference of population structure using multilocus genotype data. Genetics, 155, 945-959.
  26. Rosenberg N, Murata M, Ikeda Y, Opare-Sem O, Zivelin A, Geffen E, and Seligsohn U (2002). The frequent 5,10-methylenetetrahydrofolate reductase C677T polymorphism is associated with a common haplotype in whites, Japanese, and Africans, American Journal of Human Genetics, 70, 758-762.
  27. Schafer AJ and Hawkins JR (1998). DNA variation and the future of human genetics, Nature Biotechnology, 16, 33-39.
  28. Shriver MD, SmithMW, Jin L, Marcini A, Akey JM, Deka R, and Ferrell RE (1997). Ethnic-affiliation estimation by use of population-specific DNA markers, American Journal of Human Genetics, 60, 957-964.
  29. Taillon-Miller P, Piernot EE, and Kwok PY (1999). Efficient approach to unique single-nucleotide polymorphism discovery, Genome Research, 9, 499-505.
  30. Tusher VG, Tibshirani R, and Chu G (2001). Significance analysis of microarrays applied to the ionizing radiation response. In Proceedings of the National Academy of Sciences of the United States of America, 98, 5116-5121.
  31. Vapnik VN (2000). The Nature of Statistical Learning Theory (2nd ed), Springer, New York.