DOI QR코드

DOI QR Code

Developing JSequitur to Study the Hierarchical Structure of Biological Sequences in a Grammatical Inference Framework of String Compression Algorithms

  • Received : 2012.11.01
  • Accepted : 2012.11.16
  • Published : 2012.12.31

Abstract

Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.

Keywords

References

  1. Sakakibara Y. Grammatical inference in bioinformatics. IEEE Trans Pattern Anal Mach Intell 2005;27:1051-1062. https://doi.org/10.1109/TPAMI.2005.140
  2. Coste F. Modelling biological sequences by grammatical inference. In: ICGI 2010 Tutorial Day. Valencia: International Conference on Grammatical Inference, 2010. Accessed 2012 Nov 1. Available from: http://www.irisa.fr/symbiose/people/ fcoste/pub/biblio_tutoICGI2010_coste.pdf.
  3. Park HS, Galbadrakh B, Kim YM. Recent progresses in the linguistic modeling of biological sequences based on formal language theory. Genomics Inform 2011;9:5-11. https://doi.org/10.5808/GI.2011.9.1.005
  4. Nevill-Manning CG, Witten IH. Compression and explanation using hierarchical grammars. Comput J 1997;40:103-116. https://doi.org/10.1093/comjnl/40.2_and_3.103
  5. Lanctot JK, Li M, Yang E. Estimating DNA sequence entropy. In: Proceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms, 2000 Jan 9-11, San Franciscon, CA. Philadelphia: Society for Industrial and Applied Mathematics, 2000. pp. 409-418.
  6. Cherniavsky N, Ladner R. Grammar-based compression of DNA sequences. UW CSE Technical Report (TR2007-05-02). In: DIMACS Working Group on the Burrows-Wheeler Transfrom, 2004 Aug 19-20, Piscataway, NJ.
  7. Carrascosa R, Coste F, Galle M, Infante-Lopez G. Searching for smallest grammars on large sequences and application to DNA. J Discrete Algorithms 2012;11:62-72. https://doi.org/10.1016/j.jda.2011.04.006
  8. Galbadrakh B. Identifying hierarchical structure in biological sequences based on context-free grammars. M.S. Thesis. Seoul: Ewha Womans University, 2011.