Survey on Nucleotide Encoding Techniques and SVM Kernel Design for Human Splice Site Prediction

  • Bari, A.T.M. Golam (Department of Computer Engineering, Kyung Hee University) ;
  • Reaz, Mst. Rokeya (Department of Computer Engineering, Kyung Hee University) ;
  • Choi, Ho-Jin (Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST)) ;
  • Jeong, Byeong-Soo (Department of Computer Engineering, Kyung Hee University)
  • Received : 2012.12.12
  • Accepted : 2012.12.31
  • Published : 2012.12.31


Splice site prediction in DNA sequence is a basic search problem for finding exon/intron and intron/exon boundaries. Removing introns and then joining the exons together forms the mRNA sequence. These sequences are the input of the translation process. It is a necessary step in the central dogma of molecular biology. The main task of splice site prediction is to find out the exact GT and AG ended sequences. Then it identifies the true and false GT and AG ended sequences among those candidate sequences. In this paper, we survey research works on splice site prediction based on support vector machine (SVM). The basic difference between these research works is nucleotide encoding technique and SVM kernel selection. Some methods encode the DNA sequence in a sparse way whereas others encode in a probabilistic manner. The encoded sequences serve as input of SVM. The task of SVM is to classify them using its learning model. The accuracy of classification largely depends on the proper kernel selection for sequence data as well as a selection of kernel parameter. We observe each encoding technique and classify them according to their similarity. Then we discuss about kernel and their parameter selection. Our survey paper provides a basic understanding of encoding approaches and proper kernel selection of SVM for splice site prediction.


Supported by : National Research Foundation (NRF)


  1. Huang, J., Li, T., Chen, K., and Wu, J. (2006). An approach of encoding for prediction of splice sites using SVM. Biochimie 88, 923-929.
  2. Thanaraj, T. A., and Clark, F. (2001). Human GC-AG alternative intron isoforms with weak donor sites show enhanced consensus at acceptor exon positions. Nucleic Acids Res 29, 2581-2593.
  3. Sun, Y. F., Fan, X. D., and Li, Y. D. (2003). Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput Biol Med 33, 17-29.
  4. Zhang, Y., Chu, C. H., Chen, Y., Zha, H., and Ji, X. (2006). Splice site prediction using support vector machines with a Bayes kernel. Expert Syst Appl 30, 73-81.
  5. Hua, S., and Sun, Z. (2001). A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 308, 397-407.
  6. Ogura, H., Agata, H., Xie, M., Odaka, T., and Furutani, H. (1997). A study of learning splice sites of DNA sequence by neural networks. Comput Biol Med 27, 67-75.
  7. Cristianini, N., and Shawe-Taylor, J. (2000). An introduction to support vector machines: and other kernel-based learning methods. New York: Cambridge University Press.
  8. Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag New York, Inc.
  9. Nantasenamat, C., Thanakorn, N., Isarankura-Na-Ayudhya, C., and Prachayasittikul, V. (2005). Recognition of DNA splice junction via machine learning approaches. Excli Journal 4, 114-129.
  10. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
  11. Drucker, H., Wu, D., and Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Trans Neural Network 10, 1048-1054.
  12. Cortes, C., and Vapnik, V. (1995). Support-Vector Networks. Machine Learning. p. 273-297.
  13. Wikipedia (2012). Central dogma of molecular biology. Wikipedia, The Free Encyclopedia [cited 2012 Jul 7]. Available from: http://en.wikipedia. org/w/index.php?title=Central_dogma_of_molecular_biology& oldid=522262643.
  14. Leavitt, S. A. (2010). Deciphering the Genetic Code: Marshall Nirenberg. Office of NIH History [cited 2012 Jul]. Available from: http://history.nih. gov/exhibits/nirenberg/.
  15. Hastings, M. L., and Krainer, A. R. (2001). Pre-mRNA splicing in the new millennium. Curr Opin Cell Biol 13, 302-309.
  16. Snyder, E. E., and Stormo, G. D. (1995). Identification of protein coding regions in genomic DNA. J Mol Biol 248, 1-18.
  17. Baten, A. K., Chang, B. C., Halgamuge, S. K., and Li, J. (2006). Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 7 Suppl 5, S15.
  18. Mathe, C., Sagot, M. F., Schiex, T., and Rouze, P. (2002). Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30, 4103-4117.
  19. Pertea, M., Lin, X., and Salzberg, S. L. (2001). GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 29, 1185- 1190.
  20. Hebsgaard, S. M., Korning, P. G., Tolstrup, N., Engelbrecht, J., Rouze, P., and Brunak, S. (1996). Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24, 3439-3452.
  21. Tolstrup, N., Rouze, P., and Brunak, S. (1997). A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. Nucleic Acids Res 25, 3159-3163.
  22. Rogozin, I., and Milanesi, L. (1997). Analysis of donor splice sites in different eukaryotic organisms. J Mol Evol 45, 50-59.
  23. Reese, M. G., Eeckman, F. H., Kulp, D., and Haussler, D. (1997). Improved splice site detection in Genie. Proceedings of the first annual international conference on Computational molecular biology; Santa Fe, New Mexico, United States. USA: ACM. p. 232-240.
  24. Brendel, V., Kleffe, J., Carle-Urioste, J. C., and Walbot, V. (1998). Prediction of splice sites in plant pre-mRNA from sequence properties. J Mol Biol 276, 85-104.
  25. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B., and Brendel, V. (1996). Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. Nucleic Acids Res 24, 4709-4718.
  26. Wei, D., Zhuang, W., Jiang, Q., and Wei, Y. (2012). A new classification method for human gene splice site prediction. Proceedings of the First international conference on Health Information Science; Beijing, China. Springer-Verlag. p. 121-130.

Cited by

  1. Splice site identification in human genome using random forest vol.7, pp.1, 2017,