Global Sequence Homology Detection Using Word Conservation Probability

Yang, Jae-Seong;Kim, Dae-Kyum;Kim, Jin-Ho;Kim, Sang-Uk;

doi:10.4051/ibc.2011.3.4.0014

Interdisciplinary Bio Central

Volume 3 Issue 4
/
Pages.14.1-14.9
/
2011
/
2005-8543(eISSN)

Korean Society for Bioinformatics (한국생명정보학회)

DOI QR Code

Global Sequence Homology Detection Using Word Conservation Probability

Yang, Jae-Seong (School of Interdisciplinary Bioscience and Bioengineering, Pohang University of Science and Technology) ;
Kim, Dae-Kyum (Division of Molecular and Life Science, Pohang University of Science and Technology) ;
Kim, Jin-Ho (Division of Molecular and Life Science, Pohang University of Science and Technology) ;
Kim, Sang-Uk (School of Interdisciplinary Bioscience and Bioengineering, Pohang University of Science and Technology)

Received : 2011.10.05
Accepted : 2011.10.17
Published : 2011.12.30

https://doi.org/10.4051/ibc.2011.3.4.0014 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Protein homology detection is an important issue in comparative genomics. Because of the exponential growth of sequence databases, fast and efficient homology detection tools are urgently needed. Currently, for homology detection, sequence comparison methods using local alignment such as BLAST are generally used as they give a reasonable measure for sequence similarity. However, these methods have drawbacks in offering overall sequence similarity, especially in dealing with eukaryotic genomes that often contain many insertions and duplications on sequences. Also these methods do not provide the explicit models for speciation, thus it is difficult to interpret their similarity measure into homology detection. Here, we present a novel method based on Word Conservation Score (WCS) to address the current limitations of homology detection. Instead of counting each amino acid, we adopted the concept of 'Word' to compare sequences. WCS measures overall sequence similarity by comparing word contents, which is much faster than BLAST comparisons. Furthermore, evolutionary distance between homologous sequences could be measured by WCS. Therefore, we expect that sequence comparison with WCS is useful for the multiple-species-comparisons of large genomes. In the performance comparisons on protein structural classifications, our method showed a considerable improvement over BLAST. Our method found bigger micro-syntenic blocks which consist of orthologs with conserved gene order. By testing on various datasets, we showed that WCS gives faster and better overall similarity measure compared to BLAST.

Keywords

References

Fitch, W.M. (1970). Distinguishing homologous from analogous proteins. Syst Zool 19, 99-113. https://doi.org/10.2307/2412448
Wu, C.H., Huang, H, Yeh, L.S., and Barker, W.C. (2003). Protein family classification and functional annotation. Comput Biol Chem 27, 37-47. https://doi.org/10.1016/S1476-9271(02)00098-1
Goodstadt, L., and Ponting, C.P. (2006). Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol 2, e133. https://doi.org/10.1371/journal.pcbi.0020133
Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M.F., Kellis, M., Lindblad- Toh, K., and Lander, E.S. (2007). Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A 104, 19428-19433. https://doi.org/10.1073/pnas.0709013104
Redfern, O., Grant, A., Maibaum, M., and Orengo, C. (2005). Survey of current protein family databases and their application in comparative, structural and functional genomics. J Chromatogr B Analyt Technol Biomed Life Sci 815, 97-107. https://doi.org/10.1016/j.jchromb.2004.11.010
Smith, T.F., and Waterman, M.S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-197. https://doi.org/10.1016/0022-2836(81)90087-5
Lipman, D.J., and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441. https://doi.org/10.1126/science.2983426
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
Itoh, M., Goto, S., Akutsu, T., and Kanehisa, M. (2005). Fast and accurate database homology search using upper bounds of local alignment scores. Bioinformatics 21, 912-921. https://doi.org/10.1093/bioinformatics/bti076
Moreno-Hagelsieb, G., and Latimer, K. (2008). Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24, 319-324. https://doi.org/10.1093/bioinformatics/btm585
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680. https://doi.org/10.1093/nar/22.22.4673
Jones, C.D., Custer, A.W., and Begun, D.J. (2005). Origin and evolution of a chimeric fusion gene in Drosophila subobscura, D. madeirensis and D. guanche. Genetics 170, 207-219. https://doi.org/10.1534/genetics.104.037283
Sayah, D.M., Sokolskaja, E., Berthoux, L., and Luban, J. (2004). Cyclophilin A retrotransposition into TRIM5 explains owl monkey resistance to HIV-1. Nature 430, 569-573. https://doi.org/10.1038/nature02777
Long, M., Betran, E., Thornton, K., and Wang, W. (2003). The origin of new genes: glimpses from the young and old. Nat Rev Genet 4, 865-875.
Fumasoni, I., Meani, N., Rambaldi, D., Scafetta, G., Alcalay, M., and Ciccarelli, F.D. (2007). Family expansion and gene rearrangements contributed to the functional specialization of PRDM genes in vertebrates. BMC Evol Biol 7, 187. https://doi.org/10.1186/1471-2148-7-187
Ben-Shlomo, I., Yu Hsu, S., Rauch, R., Kowalski, H.W., and Hsueh, A.J. (2003). Signaling receptome: a genomic and evolutionary perspective of plasma membrane receptors involved in signal transduction. Sci STKE 2003: RE9.
Alexeyenko, A., Tamas, I., Liu, G., and Sonnhammer, E.L. (2006). Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22, e9-15. https://doi.org/10.1093/bioinformatics/btl213
Tian, W., and Skolnick, J. (2003). How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333, 863- 882. https://doi.org/10.1016/j.jmb.2003.08.057
Hegyi, H., and Gerstein, M. (1999). The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 288, 147-164. https://doi.org/10.1006/jmbi.1999.2661
Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng 12, 85-94. https://doi.org/10.1093/protein/12.2.85
Hochreiter, S., Heusel, M., and Obermayer, K. (2007). Fast model-based protein homology detection without alignment. Bioinformatics 23, 1728-1736. https://doi.org/10.1093/bioinformatics/btm247
Ben-Hur, A., and Brutlag, D. (2003). Remote homology detection: a motif based approach. Bioinformatics 19 Suppl 1, i26-33. https://doi.org/10.1093/bioinformatics/btg1002
Tong, A.H., Drees, B., Nardelli, G., Bader, G.D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S., Nelson, B., Paoluzi, S., et al. (2002). A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295, 321- 324. https://doi.org/10.1126/science.1064987
Kunik, V., Meroz, Y., Solan, Z., Sandbank, B., Weingart, U., Ruppin, E., and Horn, D. (2007). Functional representation of enzymes by specific peptides. PLoS Comput Biol 3, e167. https://doi.org/10.1371/journal.pcbi.0030167
Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659. https://doi.org/10.1093/bioinformatics/btl158
Remm, M., and Sonnhammer, E. (2000). Classification of transmembrane protein families in the Caenorhabditis elegans genome and identification of human orthologs. Genome Res 10, 1679-1689. https://doi.org/10.1101/gr.GR-1491R
Kamachi, Y., Cheah, K.S., and Kondoh, H. (1999). Mechanism of regulatory target selection by the SOX high-mobility-group domain proteins as revealed by comparison of SOX1/2/3 and SOX9. Mol Cell Biol 19, 107-120.
Hurst, L.D., Pal, C., and Lercher, M.J. (2004). The evolutionary dynamics of eukaryotic gene order. Nat Rev Genet 5, 299-310.
Ogul, H., and Mumcuoglu, E.U. (2007). A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. Biosystems 87, 75-81. https://doi.org/10.1016/j.biosystems.2006.03.006
Janin, J. (1979). Surface and inside volumes in globular proteins. Nature 277, 491-492. https://doi.org/10.1038/277491a0
Wolfenden, R., Andersson, L., Cullis, P.M., and Southgate, C.C. (1981). Affinities of amino acid side chains for solvent water. Biochemistry 20, 849-855. https://doi.org/10.1021/bi00507a030
Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105-132. https://doi.org/10.1016/0022-2836(82)90515-0
Rose, G.D., Geselowitz, A.R., Lesser, G.J., Lee, R.H., and Zehfus, M.H.(1985). Hydrophobicity of amino acid residues in globular proteins. Science 229, 834-838. https://doi.org/10.1126/science.4023714
Massey, K.A., Blakeslee, C.H., and Pitkow, H.S. (1998). A review of physiological and metabolic effects of essential amino acids. Amino Acids 14, 271-300. https://doi.org/10.1007/BF01318848
Karplus, P.A. (1997). Hydrophobicity regained. Protein Sci 6, 1302-1307. https://doi.org/10.1002/pro.5560060618
Windholz, M. (1984). The Merck Index Online. Science 226, 1250.
Seo, J., Gordish-Dressman, H., and Hoffman, E.P. (2006). An interactive power analysis tool for microarray hypothesis testing and generation. Bioinformatics 22, 808-814. https://doi.org/10.1093/bioinformatics/btk052
Edwards, A.W. (1969). Statistical methods in scientific inference. Nature 222, 1233-1237. https://doi.org/10.1038/2221233a0
Whittaker, E.T., and Robinson, G.(1967). The calculus of observations; an introduction to numerical analysis, 4th edition., (New York: Dover Publications).
Remm, M., Storm, C.E., and Sonnhammer, E.L. (2001). Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314, 1041-1052. https://doi.org/10.1006/jmbi.2000.5197
Kanehisa, M. (2002). The KEGG database. Novartis Found Symp 247, 91-101; discussion 101-103, 119-128, 244-152.
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004). The KEGG resource for deciphering the genome. Nucleic Acids Res 32, D277-280. https://doi.org/10.1093/nar/gkh063
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-540.
Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C., and Murzin, A.G. (2008). Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D419- 425.

Interdisciplinary Bio Central

Global Sequence Homology Detection Using Word Conservation Probability

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)