Advanced SearchSearch Tips
A Comprehensive Review of Emerging Computational Methods for Gene Identification
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
A Comprehensive Review of Emerging Computational Methods for Gene Identification
Yu, Ning; Yu, Zeng; Li, Bing; Gu, Feng; Pan, Yi;
  PDF(new window)
Gene identification is at the center of genomic studies. Although the first phase of the Encyclopedia of DNA Elements (ENCODE) project has been claimed to be complete, the annotation of the functional elements is far from being so. Computational methods in gene identification continue to play important roles in this area and other relevant issues. So far, a lot of work has been performed on this area, and a plethora of computational methods and avenues have been developed. Many review papers have summarized these methods and other related work. However, most of them focus on the methodologies from a particular aspect or perspective. Different from these existing bodies of research, this paper aims to comprehensively summarize the mainstream computational methods in gene identification and tries to provide a short but concise technical reference for future studies. Moreover, this review sheds light on the emerging trends and cutting-edge techniques that are believed to be capable of leading the research on this field in the future.
Cloud Computing;Comparative Methods;Deep Learning;Fourier Transform;Gene Identification;Gene Prediction;Hidden Markov Model;Machine Learning;Protein-Coding Region;Support Vector Machine;
 Cited by
Investigating Apache Hama: a bulk synchronous parallel computing framework, The Journal of Supercomputing, 2017, 73, 9, 4190  crossref(new windwow)
W. Klimke, C. O'Donovan, O. White, J. R. Brister, K. Clark, B. Fedoro, and T. Tatusova, "Solving the problem: genome annotation standards before the data deluge," Standards in Genomic Sciences, vol. 5, no. 1, pp. 168-193, 2011. crossref(new window)

ENCODE Project Consortium, "An integrated encyclopedia of DNA elements in the human genome," Nature, vol. 489, no. 7414, pp. 57-74, 2012. crossref(new window)

S. Djebali, C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, et al., "Landscape of transcription in human cells," Nature, vol. 489, no. 7414, pp. 101-108, 2012. crossref(new window)

J. Harrow, A. Nagy, A. Reymond, T. Alioto, L. Patthy, S. Antonarakis, and R. Guigo, "Identifying protein-coding genes in genomic sequences," Genome Biology, vol. 10, no. 1, article ID. 201, 2009.

M. Hiller, B. T. Schaar, and G. Bejerano, "Hundreds of conserved noncoding genomic regions are independently lost in mammals," Nucleic Acids Research, vol. 40, no. 22, pp. 11463-11476, 2012. crossref(new window)

M. E. Dinger, K. C. Pang, T. R. Mercer, and J. S. Mattick, "Differentiating protein-coding and noncoding RNA: challenges and ambiguities," PLoS Computational Biology, vol. 4, no. 11, article ID. e1000176, 2008.

J. W. Fickett, "Finding genes by computer: the state of the art," Trends in Genetics, vol. 12, no. 8, pp. 316-320, 1996. crossref(new window)

C. Mathe, M. F. Sagot, T. Schiex, and P. Rouze, "Current methods of gene prediction, their strengths and weaknesses," Nucleic Acids Research, vol. 30, no. 19, pp. 4103-4117, 2002. crossref(new window)

R. She, "Fast and accurate gene prediction by protein homology," Ph.D. dissertation, Simon Fraser University, Burnaby, British Columbia, Canada, 2010.

N. Goel, S. Singh, and T. C. Aseri, "A review of soft computing techniques for gene prediction," ISRN Genomics, vol. 2013, article ID. 191206, 2013.

C. Yang, E. Bolotin, T. Jiang, F. M. Sladek, and E. Martinez, "Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters," Gene, vol. 389, no. 1, pp. 52-65, 2007. crossref(new window)

P. Bucher, "Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences," Journal of Molecular Biology, vol. 212, no. 4, pp. 563-578, 1990. crossref(new window)

M. Q. Zhang, "Computational prediction of eukaryotic protein-coding genes," Nature Reviews Genetics, vol. 3, no. 9, pp. 698-709, 2002. crossref(new window)

C. Trapnell, L. Pachter, and S. L. Salzberg, "TopHat: discovering splice junctions with RNA-seq," Bioinformatics, vol. 25, no. 9, pp. 1105-1111, 2009. crossref(new window)

M. Akhtar, J. Epps, and E. Ambikairajah, "Signal processing in sequence analysis: advances in eukaryotic gene prediction," IEEE Journal of Selected Topics in Signal Processing, vol. 2, no. 3, pp. 310-321, 2008. crossref(new window)

J. W. Fickett, "Recognition of protein coding regions in DNA sequences," Nucleic Acids Research, vol. 10, no. 17, pp. 5303-5318, 1982. crossref(new window)

D. Kotlar and Y. Lavner, "Gene prediction by spectral rotation measure: a new method for identifying proteincoding regions," Genome Research, vol. 13, no. 8, pp. 1930-1937, 2003.

N. Yu, X. Guo, F. Gu, and Y. Pan, "DNA AS X: an information-coding based model to improve the sensitivity in comparative gene analysis," in Proceedings of the 11th International Symposium on Bioinformatics Research and Applications, Norfolk, VA, 2015, pp. 366-377.

R. F. Voss, "Evolution of long-range fractal correlations and 1/f noise in DNA base sequences," Physical Review Letters, vol. 68, no. 25, pp. 3805-3808, 1992. crossref(new window)

I. Cosic, "Macromolecular bioactivity: is it resonant interaction between macromolecules? Theory and applications," IEEE Transactions on Biomedical Engineering, vol. 41, no. 12, pp. 1101-1114, 1994. crossref(new window)

H. K. Kwan and S. Arniker, "Numerical representation of DNA sequences," in Proceedings of IEEE International Conference on Electro/Information Technology (eit'09), Windsor, ON, 2009, pp. 307-310.

B. D. Silverman and R. Linsker, "A measure of DNA periodicity," Journal of Theoretical Biology, vol. 118, no. 3, pp. 295-300, 1986. crossref(new window)

S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, and R. Ramaswamy, "Prediction of probable genes by fourier analysis of genomic sequences," Computer Applications in the Biosciences (CABIOS), vol. 13, no. 3, pp. 263-270, 1997.

D. Anastassiou, "Frequency-domain analysis of biomolecular sequences," Bioinformatics, vol. 16, no. 12, pp. 1073-1081, 2000. crossref(new window)

N. Rao and S. Shepherd, "Detection of 3-periodicity for small genomic sequences based on AR technique," in Proceedings of 2004 International Conference on Communications, Circuits and Systems (ICCCAS2004), Cheongdu, China, 2004, pp. 1032-1036.

G. Liu and Y. Luan, "Identification of protein coding regions in the eukaryotic DNA sequences based on marple algorithm and wavelet packets transform," Abstract and Applied Analysis, vol. 2014, article ID. 402567, 2014.

G. Zhang and G. Zhou, "The Marple algorithm for the autoregressive spectral estimates of the SMMW Fourier transform spectroscopy data," International Journal of Infrared and Millimeter Waves, vol. 10, no. 2, pp. 257-267, 1989. crossref(new window)

I. Barrodale, L. M. Delves, R. E. Erickson, and C. A. Zala, "Computational experience with Marple's algorithm for autoregressive spectrum analysis," Geophysics, vol. 48, no. 9, pp. 1274-1286, 1983. crossref(new window)

O. Abbasi, A. Rostami, and G. Karimian, "Identification of exonic regions in DNA sequences using crosscorrelation and noise suppression by discrete wavelet transform," BMC Bioinformatics, vol. 12, article ID. 430, 2011.

S. Deng, L. Yuan, K. Feng, G. Ding, and Y. Li, "A new approach for identifying protein-coding regions by combining chirp z and wavelet transform," Current Bioinformatics, vol. 8, no. 5, pp. 557-563, 2013. crossref(new window)

H. K. Kwan, R. Atwal, and B. Y. M. Kwan, "Wavelet analysis of DNA sequences," in Proceedings of International Conference on Communications, Circuits and Systems (ICCCAS2008), Fujian, China, 2008, pp. 816-820.

E. Ambikairajah, J. Epps, and M. Akhtar, "Gene and exon prediction using time domain algorithms," in Proceedings of the 8th International Symposium on Signal Processing and Its Applications (ISSPA2005), Sydney, Australia, 2005, pp. 199-202.

M. Akhtar, J. Epps, and E. Ambikairajah, "Time and frequency domain methods for gene and exon prediction in eukaryotes," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2007), Honolulu, HI, 2007, pp. 573-576.

M. Roy and S. Barman, "Effective gene prediction by high resolution frequency estimator based on least-norm solution technique," EURASIP Journal on Bioinformatics and Systems Biology, vol. 2014, no. 1, pp. 1-13, 2014. crossref(new window)

S. S. Sahu and G. Panda, "Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach," Genomics, Proteomics & Bioinformatics, vol. 9, no. 1-2, pp. 45-55, 2011. crossref(new window)

S. Deng, Y. Shi, L. Yuan, Y. Li, and G. Ding, "Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics," BMC Genomics, vol. 13, no. Suppl 8, article ID. S19, 2012. crossref(new window)

S. Mereuta and V. Munteanu, "A new information theoretic approach to exon-intron classification," in Proceedings of International Symposium on Signals, Circuits and Systems (ISSCS2007), Iasi, Romania, 2007, pp. 1-4.

W. Zhu, A. Lomsadze, and M. Borodovsky, "Ab initio gene identification in metagenomic sequences," Nucleic Acids Research, vol. 38, no. 12, article ID. e132, 2010.

M. Borodovsky and J. McIninch, "Genmark: parallel gene recognition for both DNA strands," Computers & Chemistry, vol. 17, no. 2, pp. 123-133, 1993. crossref(new window)

C. Burge and S. Karlin, "Prediction of complete gene structures in human genomic DNA," Journal of Molecular Biology, vol. 268, no. 1, pp. 78-94, 1997. crossref(new window)

A. Lomsadze, V. Ter-Hovhannisyan, Y. O. Chernoff, and M. Borodovsky, "Gene identification in novel eukaryotic genomes by self-training algorithm," Nucleic Acids Research, vol. 33, no. 20, pp. 6494-6506, 2005. crossref(new window)

D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, "A generalized hidden Markov model for the recognition of human genes in DNA," in Proceeding of the 4th International Conference on Intelligent Systems for Molecular Biology, St. Louis, MO, 1996, pp. 134-142.

L. R. Rabiner, "A tutorial on hidden markov models and selected applications in speech recognition," in Readings in Speech Recognition, A. Waibel and K. F. Lee, Eds. San Francisco, CA: Morgan Kaufmann Publishers, 1990, pp. 267-296.

D. Sankoff, "Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory," Mathematical Biosciences, vol. 111, no. 2, pp. 279-293, 1992. crossref(new window)

A. J. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260-269, 1967. crossref(new window)

V. Ter-Hovhannisyan, A. Lomsadze, Y. O. Chernoff, and M. Borodovsky, "Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training," Genome Research, vol. 18, no. 12, p. 1979- 1990, 2008. crossref(new window)

A. Lomsadze, P. D. Burns, and M. Borodovsky, "Integration of mapped RNA-seq reads into automatic training of eukaryotic gene finding algorithm," Nucleic Acids Research, vol. 42, no. 15, article ID. e119, 2014.

R. Staden, "Computer methods to locate signals in nucleic acid sequences," Nucleic Acids Research, vol. 12, no. 1 (Pt 2), pp. 505-519, 1984. crossref(new window)

R. Guigo, S. Knudsen, N. Drake, and T. Smith, "Prediction of gene structure," Journal of Molecular Biology, vol. 226, no. 1, pp. 141-157, 1992. crossref(new window)

E. E. Snyder and G. D. Stormo, "Identification of protein coding regions in genomic DNA," Journal of Molecular Biology, vol. 248, no. 1, pp. 1-18, 1995. crossref(new window)

M. Q. Zhang and T. G. Marr, "A weight array method for splicing signal analysis," Computer applications in the Biosciences (CABIOS), vol. 9, no. 5, pp. 499-509, 1993.

J. Henderson, S. Salzberg, and K. H. Fasman, "Finding genes in DNA with a hidden Markov model," Journal of Computational Biology, vol. 4, no. 2, pp. 127-141, 1997. crossref(new window)

I. Korf, P. Flicek, D. Duan, and M. R. Brent, "Integrating genomic homology into gene structure prediction," Bioinformatics, vol. 17, no. Suppl 1, pp. S140-S148, 2001. crossref(new window)

J. Wu and D. Haussler, "Coding exon detection using comparative sequences," Journal of Computational Biology, vol. 13, no. 6, pp. 1148-1164, 2006. crossref(new window)

W. H. Majoros, M. Pertea, and S. L. Salzberg, "TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders," Bioinformatics, vol. 20, no. 16, pp. 2878-2879, 2004. crossref(new window)

E. C. Uberbacher and R. J. Mural, "Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach," Proceedings of the National Academy of Sciences, vol. 88, no. 24, pp. 11261- 11265, 1991.

R. Ranawana and V. Palade, "A neural network based multi-classifier system for gene identification in DNA sequences," Neural Computing & Applications, vol. 14, no. 2, pp. 122-131, 2005. crossref(new window)

Y. Xu, J. R. Einstein, R. Mural, M. Shah, and E. C. Uberbacher, "An improved system for exon recognition and gene modeling in human DNA sequences," in Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, San Francisco, CA, 1994, pp. 376-384.

L. Roberts, N. Steele, C. Reeves, and G. King, "Training neural networks to identify coding regions in genomic DNA," in Proceedings of the 4th International Conference on Artificial Neural Networks, Cambridge, UK, 1995, pp. 399-403.

E. E. Snyder and G. D. Stormo, "Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks." Nucleic Acids Research, vol. 21, no. 3, p. 607-613, 1993. crossref(new window)

Y. Xu, R. Mural, J. Einstein, M. Shah, and E. Uberbacher, "GRAIL: a multi-agent neural network system for gene identification," Proceedings of the IEEE, vol. 84, no. 10, pp. 1544-1552, 1996.

J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley, 1991.

C. Li, P. He, and J. Wang, "Artificial neural network method for predicting protein-coding genes in the yeast genome," Internet Electronic Journal of Molecular Design, vol. 2, pp. 527-538, 2003.

M. K. K. Leung, H. Y. Xiong, L. J. Lee, and B. J. Frey, "Deep learning of the tissue-regulated splicing code," Bioinformatics, vol. 30, no. 12, pp. i121-i129, 2014. crossref(new window)

Y. Bengio, A. Courville, and P. Vincent, "Representation learning: a review and new perspectives," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, 2013. crossref(new window)

G. Hinton, P. Dayan, B. Frey, and R. Neal, "The 'wake-sleep' algorithm for unsupervised neural networks," Science, vol. 268, no. 5214, pp. 1158-1161, 1995. crossref(new window)

G. E. Hintonemail, "Learning multiple layers of representation," Trends in Cognitive Sciences, vol. 11, no. 10, pp. 428-434, 2007. crossref(new window)

L. Deng, G. Hinton, and B. Kingsbury, "New types of deep neural network learning for speech recognition and related applications: an overview," in Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, 2013, pp. 8599-8603.

P. Di Lena, K. Nagata, and P. Baldi, "Deep architectures for protein contact map prediction," Bioinformatics, vol. 28, no. 19, pp. 2449-2457, 2012. crossref(new window)

J. Eickholt and J. Cheng, "Predicting protein residue-residue contacts using deep networks and boosting," Bioinformatics, vol. 28, no. 23, pp. 3066-3072, 2012. crossref(new window)

A. Ben-Hur, C. S. Ong, S. Sonnenburg, B. Scholkopf, and G. Ratsch, "Support vector machines and kernels for computational biology," PLoS Computational Biology, vol. 4, no. 10, article ID. e1000173, 2008.

A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K. R. Muller, "Engineering support vector machine kernels that recognize translation initiation sites," Bioinformatics, vol. 16, no. 9, pp. 799-807, 2000. crossref(new window)

S. Sonnenburg, A. Zien, and G. Ratsch, "ARTS: accurate recognition of transcription starts in human," Bioinformatics, vol. 22, no. 14, pp. e472-e480, 2006. crossref(new window)

S. Sonnenburg, G. Schweikert, P. Philips, J. Behr, and G. Ratsch, "Accurate splice site prediction using support vector machines," BMC Bioinformatics, vol. 8, no. Suppl 10, article ID. S7, 2007.

H. Liu, H. Han, J. Li, and L. Wong, "An in-silico method for prediction of polyadenylation signals in human sequences," Genome Informatics, vol. 14, pp. 84-93, 2003.

B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2002.

G. Ratsch and S. Sonnenburg, "Large scale hidden semi-Markov SVMs," in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007, pp. 1161-1168.

C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.

C. Yu, M. Deng, L. Zheng, R. L. He, J. Yang, and S. S. T. Yau, "DFA7, a new method to distinguish between intron-containing and intronless genes," PLoS ONE, vol. 9, no. 7, article ID. e101363, 2014.

Y. Liu, J. Guo, G. Hu, and H. Zhu, "Gene prediction in metagenomic fragments based on the SVM algorithm," BMC Bioinformatics, vol. 14, no. Suppl 5, article ID. S12, 2013.

C. Leslie, E. Eskin, and W. S. Noble, "The spectrum kernel: a string kernel for SVM protein classification," Pacific Symposium on Biocomputing, vol. 7, pp. 564-575, 2002.

G. Ratsch, S. Sonnenburg, and B. Scholkopf, "RASE: recognition of alternatively spliced exons in C. elegans," Bioinformatics, vol. 21, no. Suppl 1, pp. i369-i377, 2005. crossref(new window)

S. Sonnenburg, G. Rätsch, C. Schafer, and B. Scholkopf, "Large scale multiple kernel learning," Journal of Machine Learning Research, vol. 7, pp. 1531-1565, 2006.

C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble, "Mismatch string kernels for discriminative protein classification," Bioinformatics, vol. 20, no. 4, pp. 467-476, 2004. crossref(new window)

P. Meinicke, M. Tech, B. Morgenstern, and R. Merkl, "Oligo kernels for data mining on biological sequences: a case study on prokaryotic translation initiation sites," BMC Bioinformatics, vol. 5, article ID. 169, 2004.

D. Haussler, "Convolution kernels on discrete structures," University of California at Santa Cruz, CA, Technical Report UCS-CRL-99-10, 1999.

L. Sun, H. Luo, D. Bu, G. Zhao, K. Yu, C. Zhang, Y. Liu, R. Chen, and Y. Zhao, "Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts," Nucleic Acids Research, vol. 41, no. 17, article ID. e166, 2013.

L. Liao and W. S. Noble, "Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships," Journal of Computational Biology, vol. 10, no. 6, pp. 857-868, 2003. crossref(new window)

H. Saigo, J. P. Vert, N. Ueda, and T. Akutsu, "Protein homology detection using string alignment kernels," Bioinformatics, vol. 20, no. 11, pp. 1682-1689, 2004. crossref(new window)

J. Vert, H. Saigo, and T. Akutsu, "Local alignment kernels for biological sequences," in Kernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J. P. Vert, Eds. Cambridge, MA: MIT Press, 2004, pp. 131- 154.

K. Tsuda, M. Kawanabe, G. Rtsch, S. Sonnenburg, and K. R. Muller, "A new discriminative kernel from probabilistic models," Neural Computation, vol. 14, no. 10, pp. 2397-2414, 2002. crossref(new window)

M. Seeger, "Covariance kernels from Bayesian generative models," in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2002, pp. 905-912.

K. Tsuda, T. Kin, and K. Asai, "Marginalized kernels for biological sequences," Bioinformatics, vol. 18, no. Suppl 1, pp. S268-S275, 2002. crossref(new window)

G. Schweikert, A. Zien, G. Zeller, J. Behr, C. Dieterich, C. S. Ong, et al., "mGENE: accurate svm-based gene finding with an application to nematode genomes," Genome Research, vol. 19, no. 11, pp. 2133-2143, 2009. crossref(new window)

U. Kamath, K. De Jong, and A. Shehu, "Effective automated feature construction and selection for classification of biological sequences," PLoS ONE, vol. 9, no. 7, article ID. e99982, 2014.

R. Zhang and C. T. Zhang, "Z curves, an intuitive tool for visualizing and analyzing the DNA sequences," Journal of Biomolecular Structure and Dynamics, vol. 11, no. 4, pp. 767-782, 1994. crossref(new window)

S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, "Humanmouse alignments with BLASTZ," Genome Research, vol. 13, no. 1, pp. 103-107, 2003. crossref(new window)

S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997. crossref(new window)

B. Ma, J. Tromp, and M. Li, "PatternHunter: faster and more sensitive homology search," Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002. crossref(new window)