DOI QR코드

DOI QR Code

Genome data mining for everyone

  • Lee, Gir-Won (Department of Bioinformatics, Soongsil University) ;
  • Kim, Sang-Soo (Department of Bioinformatics, Soongsil University)
  • Published : 2008.11.30

Abstract

The genomic sequences of a huge number of species have been determined. Typically, these genome sequences and the associated annotation data are accessed through Internet-based genome browsers that offer a user-friendly interface. Intelligent use of the data should expedite biological knowledge discovery. Such activity is collectively called data mining and involves queries that can be simple, complex, and even combinational. Various tools have been developed to make genome data mining available to computational and experimental biologists alike. In this mini-review, some tools that have proven successful will be introduced along with examples taken from published reports.

Keywords

References

  1. Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, C. A., Hutchison, C. A., Slocombe, P. M. and Smith, M. (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265, 687-695 https://doi.org/10.1038/265687a0
  2. Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S. G. (1996) Life with 6000 genes. Science 274, 563-567
  3. C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012-2018 https://doi.org/10.1126/science.282.5396.2012
  4. Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G., Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., et al. (2000) The genome sequence of Drosophila melanogaster. Science 287, 2185-2195 https://doi.org/10.1126/science.287.5461.2185
  5. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921 https://doi.org/10.1038/35057062
  6. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) The sequence of the human genome. Science 291, 1304-51 https://doi.org/10.1126/science.1058040
  7. International Human Genome Sequencing Consortium. (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931-45 https://doi.org/10.1038/nature03001
  8. Liolios, K., Mavromatis, K., Tavernarakis, N. and Kyrpides, N. C. (2008) The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 36, D475-D479 https://doi.org/10.1093/nar/gkn240
  9. Prakash, A. and Tompa, M. (2005) Discovery of regulatory elements in vertebrates through comparative genomics. Nat. Biotechnol. 23, 1249-56 https://doi.org/10.1038/nbt1140
  10. Margulies, E. H. and Birney, E. (2008) Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat. Rev. Genet. 9, 303-313 https://doi.org/10.1038/nrg2185
  11. Margulies, E. H., Vinson, J. P., NISC Comparative Sequencing Program, Miller, W., Jaffe, D. B., Lindblad-Toh, K., Chang, J. L., Green, E. D., Lander, E. S., Mullikin, J. C. and Clamp, M. (2005) An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl. Acad. Sci., U.S.A. 102, 4795-800
  12. Wolfsberg, T.G., Wetterstrand, K.A., Guyer, M.S., Collins, F.S. and Baxevanis, A.D. (2003) A user's guide to the human genome. Nat. Genet. Suppl. 1 32, 4-79
  13. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. and Haussler, D. (2002) The human genome browser at UCSC. Genome Res. 12, 996-1006 https://doi.org/10.1101/gr.229102.ArticlepublishedonlinebeforeprintinMay2002
  14. Wheeler, D.L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., Dicuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L. Y., Helmberg, W., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Miller, V., Ostell, J., Pruitt, K. D., Schuler, G. D., Shumway, M., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L. and Yaschenko, E. (2008) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36, D13-D21 https://doi.org/10.1093/nar/gkm1143
  15. Flicek, P., Aken, B. L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cunningham, F., Cutts, T., Down, T., Dyer, S. C., Eyre, T., Fitzgerald, S., Fernandez-Banet, J., Graf, S., Haider, S., Hammond, M., Holland, R., Howe, K. L., Howe, K., Johnson, N., Jenkinson, A., Kahari, A., Keefe, D., Kokocinski, F., Kulesha, E., Lawson, D., Longden, I., Megy, K., Meidl, P., Overduin, B., Parker, A., Pritchard, B., Prlic, A., Rice, S., Rios, D., Schuster, M., Sealy, I., Slater, G., Smedley, D., Spudich, G., Trevanion, S., Vilella, A. J., Vogel, J., White, S., Wood, M., Birney, E., Cox, T., Curwen, V., Durbin, R., Fernandez-Suarez, X. M., Herrero, J., Hubbard, T. J., Kasprzyk, A., Proctor, G., Smith, J., Ureta-Vidal, A. and Searle, S. (2008) Ensembl 2008. Nucleic Acids Res. 36, D707-D714 https://doi.org/10.1093/nar/gkm988
  16. Kent, W. J. (2002) BLAT - the BLAST-like alignment tool. Genome Res. 12, 656-664 https://doi.org/10.1101/gr.229202.ArticlepublishedonlinebeforeMarch2002
  17. Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., Haussler, D. and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res. 13, 103-107 https://doi.org/10.1101/gr.809403
  18. The ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia of DNA Elements) Project. Science 306, 636-640 https://doi.org/10.1126/science.1105136
  19. Gentleman, R., Carey, V. J., Huber, W., Irizarry, R. A. and Dudoit, S. (2005) Bioinformatics and Computational Biology Solutions Using R and Bioconductor, Springer, New York, USA
  20. Schattner, P. (2007) Automated querying of genome databases. PLoS Comput. Biol. 3, e1 https://doi.org/10.1371/journal.pcbi.0030001
  21. Fernandez-Suarez, X. M. and Birney, E. (2008) Advanced genomic data mining. PLoS Comput. Biol. 4, e1000121 https://doi.org/10.1371/journal.pcbi.1000121
  22. Kent, W. J., Hsu, F., Karolchik, D., Kuhn, R. M., Clawson, H., Trumbower, H. and Haussler, D. (2005) Exploring relationships and mining data with the UCSC Gene Sorter. Genome Res. 15, 737-741 https://doi.org/10.1101/gr.3694705
  23. Karolchik, D., Hinrichs, A. S., Furey, T. S., Roskin, K. M., Sugnet, C. W., Haussler, D., Kent and W. J. (2004) The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493-D496 https://doi.org/10.1093/nar/gkh103
  24. Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C., Hammond, M., Rocca-Serra, P., Cox, T. and Birney, E. (2004) EnsMart: A generic system for fast and flexible access to biological data. Genome Res. 14, 160-169 https://doi.org/10.1101/gr.1645104
  25. Giardine, B., Riemer, C., Hardison, R. C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W. J. and Nekrutenko, A. (2005) Galaxy: A platform for interactive large-scale geGiardine, B., Riemer, C., Hardison, R. C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W. J. and Nekrutenko, A. (2005) Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 15, 1451-1455 https://doi.org/10.1101/gr.4086505
  26. Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A. and Huber, W. (2005) BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439-3440 https://doi.org/10.1093/bioinformatics/bti525
  27. Thomas, D. J., Rosenbloom, K. R., Clawson, H., Hinrichs, A. S., Trumbower, H., Raney, B. J., Karolchik, D., Barber, G. P., Harte, R. A., Hillman-Jackson, J., Kuhn, R. M., Rhead, B. L., Smith, K. E., Thakkapallayil, A., Zweig, A. S., The ENCODE Project Consortium, Haussler, D. and Kent, W. J. (2007) The ENCODE project at UC Santa Cruz. Nucleic Acids Res. 35, D663-D667 https://doi.org/10.1093/nar/gkl1017
  28. Birney, E., Andrews, T. D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., Fernandez-Suarez, X. M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz, H. R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, K. C., Cameron, G., Durbin, R., Cox, A., Hubbard, T. and Clamp, M. (2004) An overview of Ensembl. Genome Res. 14, 925-928 https://doi.org/10.1101/gr.1860604
  29. Li, H. (2006) Constructing the TreeFam database. PhD thesis, the Institute of Theoretical Physics, Chinese Academy of Science, China
  30. Karolchik, D., Kuhn, R. M., Baertsch, R., Barber, G. P., Clawson, H., Diekhans, M., Giardine, B., Harte, R. A., Hinrichs, A. S., Hsu, F., Miller, W., Pedersen, J. S., Pohl, A., Raney, B. J., Rhead, B., Rosenbloom, K. R., Smith, K. E., Stanke, M., Thakkapallayil, A., Trumbower, H., Wang, T., Zweig, A. S., Haussler, D. and Kent, W. J. (2008) The UCSC Genome Browser database: 2008 update. Nucleic Acids Res. 36, D773-D779 https://doi.org/10.1093/nar/gkm966
  31. Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., Cooke, M. P., Walker, J. R. and Hogenesch, J. B. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci., U.S.A. 101, 6062-6067
  32. Halees, A. S., El-Badrawi, R. and Khabar, K. S. (2008) ARED Organism: expansion of ARED reveals AU-rich element cluster variations between human and mouse. Nucleic Acids Res. 36, D137-D140 https://doi.org/10.1093/nar/gkn610
  33. Levy, A., Sela, N. and Ast, G. (2008) TranspoGene and microTranspoGene: transposed elements influence on the transcriptome of seven vertebrates and invertebrates. Nucleic Acids Res. 36, D47-D52 https://doi.org/10.1093/nar/gkm949
  34. Blankenberg, D., Taylor, J., Schenck, I., He, J., Zhang, Y., Ghent, M., Veeraraghavan, N., Albert, I., Miller, W., Makova, K.D., Hardison, R.C. and Nekrutenko, A. (2007) A framework for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly. Genome Res. 17, 960-964 https://doi.org/10.1101/gr.5578007
  35. Mardis, E.R. (2006) Anticipating the 1,000 dollar genome. Genome Biol. 7, 112 https://doi.org/10.1186/gb-2006-7-7-112
  36. von Bubnoff, A. (2008) Next-generation sequencing: the race is on. Cell 132, 721-723 https://doi.org/10.1016/j.cell.2008.02.028
  37. Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W., Shago, M., Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Beeson, K.Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., Rogers, Y.H., Frazier, M.E., Scherer, S.W., Strausberg, R.L. and Venter, J.C. (2007) The diploid genome sequence of an individual human. PLoS Biol. 5, e254 https://doi.org/10.1371/journal.pbio.0050254
  38. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y.J., Makhijani, V., Roth, G.T., Gomes, X., Tartaro, K., Niazi, F., Turcotte, C.L., Irzyk, G.P., Lupski, J.R., Chinault, C., Song, X.Z., Liu, Y., Yuan, Y., Nazareth, L., Qin, X., Muzny, D.M., Margulies, M., Weinstock, G.M., Gibbs, R.A. and Rothberg, J.M. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872-876 https://doi.org/10.1038/nature06884
  39. Kidd, J.M., Cooper, G.M., Donahue, W.F., Hayden, H.S., Sampas, N., Graves, T., Hansen, N., Teague, B., Alkan, C., Antonacci, F., Haugen, E., Zerr, T., Yamada, N.A., Tsang, P., Newman, T.L., Tuzun, E., Cheng, Z., Ebling, H.M., Tusneem, N., David, R., Gillett, W., Phelps, K.A., Weaver, M., Saranga, D., Brand, A., Tao, W., Gustafson, E., McKernan, K., Chen, L., Malig, M., Smith, J.D., Korn, J.M., McCarroll, S.A., Altshuler, D.A., Peiffer, D.A., Dorschner, M., Stamatoyannopoulos, J., Schwartz, D., Nickerson, D.A., Mullikin, J.C., Wilson, R.K., Bruhn, L., Olson, M.V., Kaul, R., Smith, D.R. and Eichler, E.E. (2008) Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56-64 https://doi.org/10.1038/nature06862
  40. Yu, U., Lee, S.H., Kim, Y.J. and Kim, S. (2004) Bioinformatics in the post-genome era. J. Biochem. Mol. Biol. 37, 75-82 https://doi.org/10.5483/BMBRep.2004.37.1.075

Cited by

  1. Intelligent mining of large-scale bio-data: Bioinformatics applications 2017, https://doi.org/10.1080/13102818.2017.1364977
  2. Identification and characterization of a mesophilic phytase highly resilient to high-temperatures from a fungus-garden associated metagenome vol.100, pp.5, 2016, https://doi.org/10.1007/s00253-015-7097-9