DOI QR코드

DOI QR Code

Identifying Variable-Length Palindromic Pairs in DNA Sequences

DNA사슬 내에서 다양한 길이의 팰린드롬쌍 검색 연구

  • 김형래 (한국고용정보원 정보화기획팀) ;
  • 정경희 (관동대학교 전자계산공학과) ;
  • 전도홍 (관동대학교 컴퓨터학과)
  • Published : 2007.10.31

Abstract

The emphasis in genome projects has Been moving towards the sequence analysis in order to extract biological "meaning"(e.g., evolutionary history of particular molecules or their functions) from the sequence. Especially. palindromic or direct repeats that appear in a sequence have a biophysical meaning and the problem is to recognize interesting patterns and configurations of words(strings of characters) over complementary alphabets. In this paper, we propose an algorithm to identify variable length palindromic pairs(longer than a threshold), where we can allow gaps(distance between words). The algorithm is called palindrome algorithm(PA) and has O(N) time complexity. A palindromic pair consists of a hairpin structure. By composing collected palindromic pairs we build n-pair palindromic patterns. In addition, we dot some of the longest pairs in a circle to represent the structure of a DNA sequence. We run the algorithm over several selected genomes and the results of E.coli K12 are presented. There existed very long palindromic pair patterns in the genomes, which hardly occur in a random sequence.

게놈 프로젝트 연구는 DNA사슬 내에서 생물학적 의미(예, molecule의 진화역사 또는 그 기능)를 추출하기위한 사슬분석 쪽으로 강조가 되어가고 있다. 특히, DNA사슬 내에서 상보적 또는 반복되는 패턴은 생물학적 의미를 가지고 있다. 문제는 상보적 단어가 만들어내는 흥미 있는 패턴과 단어 구성을 찾아 내는 것이다. 본 논문은 다양한 길이의 팰린드롬 쌍을 검색하는 알고리즘에 관한 연구이다. 다양한 길이의 팰린드롬 쌍 내에는 빈 공백을 또한 허용한다. 알고리즘은 팰린드롬 알고리즘이라고 명명하며 O(N)의 계산 시간을 가진다. 하나의 팰린드롬 쌍은 머리핀 형태로 구성되어 있다. 검출된 여러 팰린드롬 쌍을 활용하여 n-쌍 팰린드롬 형태를 구성하였다. 더욱이 발견된 가장 긴 팰린드롬 쌍을 DNA 사슬 원형 구조에 점으로 표현하여 가시성을 제고하였다. 본 알고리즘은 여러 게놈 상에서 실시되었으며 E.coli K12의 결과를 나타내었다. 실험결과 DNA 안에는 랜덤한 경우에는 확률상 매우 발생하기 힘든 긴 팰린드롬 패턴들이 존재 한다는 것을 발견할 수 있었다.

Keywords

References

  1. A. Apostolico, D. Breslauer, and Z. Galil, Parallel dtection of all palindromes in a string', Theoretical Computer Science, 141:1, pp.163-173, 1995 https://doi.org/10.1016/0304-3975(94)00083-U
  2. A.H.L. Porto and V.C. Barbosa, Finding approximate palindromes in strings, Pattern Recognition 35, pp.2581-1591, 2002 https://doi.org/10.1016/S0031-3203(01)00179-0
  3. D. Breslauer and Z. Galil, Finding all periods and initial palindromes of a string in parallel, Algorithrnica 14:4, pp.355- 366, 1995 https://doi.org/10.1007/BF01294132
  4. D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, NY, 1997
  5. Ganick, Larry, and M. Wheelis. The Cartoon Guide To Genetics. New York: Bames & Noble, pp.147-148, 1983
  6. J.A. Eisen, J.F. Heidelberg, O. White, and S.L. Salzberg, Evidence for symmetric chromosomal inversions around the replication origin in bacteria, Genome Biology 1(6), 2000 https://doi.org/10.1186/gb-2000-1-6-research0011
  7. J. Jurka, Origin and evaluation of alu repetitive elements, in R.J. Maraia (Ed.), The Impact of Short Interspersed Element (SINEs) on the Host Genome, R.G. Landes, New York, NY, pp.25-41, 1995
  8. J.T.L. Wang, Discovering active motifs in sets of related protein sequences and using them for classification, Nucleic Acids Research 22(14), pp.2769-2775, 1994 https://doi.org/10.1093/nar/22.14.2769
  9. K. Shishido, N. Komiyama, and S. Ikawa, Increased production of a knotted form of plasmid pbr322 DNA in Escherichia coli DNA topisomeraes mutants, Journal of Molecular Biology, pp.215-218, 2003
  10. L.J. Jensen, C. Friis, and D.W Ussery, Three view of microbial genomes, Res. Microbiol 150, pp.773-777, 1999 https://doi.org/10.1016/S0923-2508(99)00116-3
  11. National Center for Biotechnology Information, The complete genome sequence of Salmonella enterica serovar typhimurium LT2, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd= Retrieve&db =nucleotide&list_uids=16763390&dopt=GenBank, 2003
  12. National Center for Biotechnology Information, The complete genome sequence of Escherichia coli 0157, 2001, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd = Retrieve&db=nucleotide&list_uids=16445223&dopt=GenBank, 2003
  13. National Center for Biotechnology Information, The complete genome sequence of Escherichia coli K12, 1997, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Re trieve&db=nucleotide&list_uids=16127994&dopt=GenBank, 23 Apr. 2001
  14. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, Basic locak alignment search tool, J. Mol. BioI. 215, pp.403-410, 1990 https://doi.org/10.1016/S0022-2836(05)80360-2
  15. S.F. Altschul also claimed that BLAST will have O($N^2$) time complexity to find all palindromes, Personal email communication, July 29, 2003
  16. T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. New York: McGraw-Hill, pp.857-861, 1998
  17. T.L. Bailey, Discovering motifs in DNA and protein sequences, University of California at Sandiego (PhD dissertation), 1995
  18. T. Tsunoda, M. Fukagawa, and T. Takagi, Time and memory efficient algorithm for extracting palindromic and repetitive subsequences in nucleic acid sequences, Pacific Symposium on Biocomputing 4, pp.202-213, 1999
  19. X. Guan and E.C. Uberbacher, A fast look-up algorithm for detecting repetitive DNA sequences, Pacific Symposium on Biocomputing, Singapore, pp.718-719, 1996