Global Sequence Homology Detection Using Word Conservation Probability

  • Yang, Jae-Seong (School of Interdisciplinary Bioscience and Bioengineering, Pohang University of Science and Technology) ;
  • Kim, Dae-Kyum (Division of Molecular and Life Science, Pohang University of Science and Technology) ;
  • Kim, Jin-Ho (Division of Molecular and Life Science, Pohang University of Science and Technology) ;
  • Kim, Sang-Uk (School of Interdisciplinary Bioscience and Bioengineering, Pohang University of Science and Technology)
  • Received : 2011.10.05
  • Accepted : 2011.10.17
  • Published : 2011.12.30


Protein homology detection is an important issue in comparative genomics. Because of the exponential growth of sequence databases, fast and efficient homology detection tools are urgently needed. Currently, for homology detection, sequence comparison methods using local alignment such as BLAST are generally used as they give a reasonable measure for sequence similarity. However, these methods have drawbacks in offering overall sequence similarity, especially in dealing with eukaryotic genomes that often contain many insertions and duplications on sequences. Also these methods do not provide the explicit models for speciation, thus it is difficult to interpret their similarity measure into homology detection. Here, we present a novel method based on Word Conservation Score (WCS) to address the current limitations of homology detection. Instead of counting each amino acid, we adopted the concept of 'Word' to compare sequences. WCS measures overall sequence similarity by comparing word contents, which is much faster than BLAST comparisons. Furthermore, evolutionary distance between homologous sequences could be measured by WCS. Therefore, we expect that sequence comparison with WCS is useful for the multiple-species-comparisons of large genomes. In the performance comparisons on protein structural classifications, our method showed a considerable improvement over BLAST. Our method found bigger micro-syntenic blocks which consist of orthologs with conserved gene order. By testing on various datasets, we showed that WCS gives faster and better overall similarity measure compared to BLAST.


  1. Fitch, W.M. (1970). Distinguishing homologous from analogous proteins. Syst Zool 19, 99-113.
  2. Wu, C.H., Huang, H, Yeh, L.S., and Barker, W.C. (2003). Protein family classification and functional annotation. Comput Biol Chem 27, 37-47.
  3. Goodstadt, L., and Ponting, C.P. (2006). Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol 2, e133.
  4. Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M.F., Kellis, M., Lindblad- Toh, K., and Lander, E.S. (2007). Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A 104, 19428-19433.
  5. Redfern, O., Grant, A., Maibaum, M., and Orengo, C. (2005). Survey of current protein family databases and their application in comparative, structural and functional genomics. J Chromatogr B Analyt Technol Biomed Life Sci 815, 97-107.
  6. Smith, T.F., and Waterman, M.S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-197.
  7. Lipman, D.J., and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.
  8. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410.
  9. Itoh, M., Goto, S., Akutsu, T., and Kanehisa, M. (2005). Fast and accurate database homology search using upper bounds of local alignment scores. Bioinformatics 21, 912-921.
  10. Moreno-Hagelsieb, G., and Latimer, K. (2008). Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24, 319-324.
  11. Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680.
  12. Jones, C.D., Custer, A.W., and Begun, D.J. (2005). Origin and evolution of a chimeric fusion gene in Drosophila subobscura, D. madeirensis and D. guanche. Genetics 170, 207-219.
  13. Sayah, D.M., Sokolskaja, E., Berthoux, L., and Luban, J. (2004). Cyclophilin A retrotransposition into TRIM5 explains owl monkey resistance to HIV-1. Nature 430, 569-573.
  14. Long, M., Betran, E., Thornton, K., and Wang, W. (2003). The origin of new genes: glimpses from the young and old. Nat Rev Genet 4, 865-875.
  15. Fumasoni, I., Meani, N., Rambaldi, D., Scafetta, G., Alcalay, M., and Ciccarelli, F.D. (2007). Family expansion and gene rearrangements contributed to the functional specialization of PRDM genes in vertebrates. BMC Evol Biol 7, 187.
  16. Ben-Shlomo, I., Yu Hsu, S., Rauch, R., Kowalski, H.W., and Hsueh, A.J. (2003). Signaling receptome: a genomic and evolutionary perspective of plasma membrane receptors involved in signal transduction. Sci STKE 2003: RE9.
  17. Alexeyenko, A., Tamas, I., Liu, G., and Sonnhammer, E.L. (2006). Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22, e9-15.
  18. Tian, W., and Skolnick, J. (2003). How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333, 863- 882.
  19. Hegyi, H., and Gerstein, M. (1999). The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 288, 147-164.
  20. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng 12, 85-94.
  21. Hochreiter, S., Heusel, M., and Obermayer, K. (2007). Fast model-based protein homology detection without alignment. Bioinformatics 23, 1728-1736.
  22. Ben-Hur, A., and Brutlag, D. (2003). Remote homology detection: a motif based approach. Bioinformatics 19 Suppl 1, i26-33.
  23. Tong, A.H., Drees, B., Nardelli, G., Bader, G.D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S., Nelson, B., Paoluzi, S., et al. (2002). A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295, 321- 324.
  24. Kunik, V., Meroz, Y., Solan, Z., Sandbank, B., Weingart, U., Ruppin, E., and Horn, D. (2007). Functional representation of enzymes by specific peptides. PLoS Comput Biol 3, e167.
  25. Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659.
  26. Remm, M., and Sonnhammer, E. (2000). Classification of transmembrane protein families in the Caenorhabditis elegans genome and identification of human orthologs. Genome Res 10, 1679-1689.
  27. Kamachi, Y., Cheah, K.S., and Kondoh, H. (1999). Mechanism of regulatory target selection by the SOX high-mobility-group domain proteins as revealed by comparison of SOX1/2/3 and SOX9. Mol Cell Biol 19, 107-120.
  28. Hurst, L.D., Pal, C., and Lercher, M.J. (2004). The evolutionary dynamics of eukaryotic gene order. Nat Rev Genet 5, 299-310.
  29. Ogul, H., and Mumcuoglu, E.U. (2007). A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. Biosystems 87, 75-81.
  30. Janin, J. (1979). Surface and inside volumes in globular proteins. Nature 277, 491-492.
  31. Wolfenden, R., Andersson, L., Cullis, P.M., and Southgate, C.C. (1981). Affinities of amino acid side chains for solvent water. Biochemistry 20, 849-855.
  32. Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105-132.
  33. Rose, G.D., Geselowitz, A.R., Lesser, G.J., Lee, R.H., and Zehfus, M.H.(1985). Hydrophobicity of amino acid residues in globular proteins. Science 229, 834-838.
  34. Massey, K.A., Blakeslee, C.H., and Pitkow, H.S. (1998). A review of physiological and metabolic effects of essential amino acids. Amino Acids 14, 271-300.
  35. Karplus, P.A. (1997). Hydrophobicity regained. Protein Sci 6, 1302-1307.
  36. Windholz, M. (1984). The Merck Index Online. Science 226, 1250.
  37. Seo, J., Gordish-Dressman, H., and Hoffman, E.P. (2006). An interactive power analysis tool for microarray hypothesis testing and generation. Bioinformatics 22, 808-814.
  38. Edwards, A.W. (1969). Statistical methods in scientific inference. Nature 222, 1233-1237.
  39. Whittaker, E.T., and Robinson, G.(1967). The calculus of observations; an introduction to numerical analysis, 4th edition., (New York: Dover Publications).
  40. Remm, M., Storm, C.E., and Sonnhammer, E.L. (2001). Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314, 1041-1052.
  41. Kanehisa, M. (2002). The KEGG database. Novartis Found Symp 247, 91-101; discussion 101-103, 119-128, 244-152.
  42. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004). The KEGG resource for deciphering the genome. Nucleic Acids Res 32, D277-280.
  43. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-540.
  44. Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C., and Murzin, A.G. (2008). Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D419- 425.