Finding Approximate Covers of Strings

문자열의 근사커버 찾기

  • 심정섭 (서울대학교 컴퓨터공학부) ;
  • 박근수 (서울대학교 컴퓨터공학부) ;
  • 김성렬 ((주) 와이즈넛 연구원) ;
  • 이지수 (한국방송대학교 전자계산학과)
  • Published : 2002.02.01

Abstract

Repetitive strings have been studied in such diverse fields as molecular biology data compression etc. Some important regularities that have been studied are perods, covers seeds and squares. A natural extension of the repetition problems is to allow errors. Among the four notions above aproximate squares and approximate periodes have been studied. In this paper, we introduce the notion of approximate covers which is an approximate version of covers. Given two strings P(|P|=m) and T(|T|=n) we propose and algorithm with finds the minimum distance t such that P is a t-approximate cover of T. The algorithm take O(m,n) time for the edit distance and $O(mn^2)$ time of finding a string which is an approximate cover of T is minimum distance is NP-complete.

반복적인 문자열에 대한 연구는 최근 들어 여러 분야에서 활발히 진행되어 왔다. 특히, DNA 염기서열의 분석 등 분자생물학에서 그 필용성이 대두되어 있다. 주기 커버, 시드 시퀘어 등이 반복적인 문자열의 대표적인 예들이다. 근사문자열 매칭 분야에서도 근사주기, 근사스퀘어 등 반복적인 문자열에 관 한 연구가 진행되고 있다. 본 논문에서는 근사커버의 개념을 제시한다. 길이가 각각 m, n 인 두 문자열 P. T가 주어졌을 때, P가 T의 근사커버가 되는 최소의 편집거리를 O(mn) 시간, 최소의 가중편집거리를 $O(mn^2)$시간에 찾는 알 고리즘을 제시한다. 또한 문자열 T만 주어졌을 때. T의 최소 근사커버 거리를 갖는 문자열 P를 찾는 문제가 NP-완전 결과임을 증명한다.

Keywords

References

  1. M. Crochemore, String-matching and periods, Bulletin of the European Association for Theoretical Computer Science 39 (1989), 149-153
  2. A. Apostolico, D. Breslauer and Z. Galil, Optimal parallel algorithms for periods, palindromes and squares, Proc. 19th Int. Colloq. Automata Languages and Programming, LNCS 623 (1992), 296-307
  3. A. Apostolico, M. Farach and C. S. Iliopoulos, Optimal superprimitivity testing for strings, Information Processing Letters 39 (1991), 17-20 https://doi.org/10.1016/0020-0190(91)90056-N
  4. D. Breslauer, An on-line string superprimitivity test, Information Processing Letters 44 (1992), 345-347 https://doi.org/10.1016/0020-0190(92)90111-8
  5. C.S. Iliopoulos and K. Park, An optimal O( log log n)-time algorithm for parallel super-primitivity testing, J. KISS 21, 8 (1994), 1400-1404
  6. C.S. Iliopoulos, D.W.G. Moore and K. Park, Covering a string, Algorithmica 16 (1996), 288-297 https://doi.org/10.1007/BF01955677
  7. M.G. Main and R.J. Lorentz, An O(n log n) algorithm for finding all repetitions in a string, J. Algorithms 5 (1984), 422-432 https://doi.org/10.1016/0196-6774(84)90021-X
  8. A. Apostolico, Fast parallel detection of squares in strings, Algorithmica 8 (1992), 285-319 https://doi.org/10.1007/BF01758848
  9. G.M. Landau and J.P. Schmidt, An algorithm for approximate tandem repeats, Proc. 4th Symp. Combinatorial Pattern Matching, LNCS 648 (1993), 120-133
  10. J.P. Schmidt, All highest scoring paths in weighted grid graphs and its application to finding all approximate repeats in strings, SIAM J. Computing 27, 4 (1998), 972-992 https://doi.org/10.1137/S0097539795288489
  11. J.S. Sim, C.S. Iliopoulos, K. Park, W.F. Smyth, Approximate periods of strings, Theoretical Computer Science, 252 (2001), 557-568 https://doi.org/10.1016/S0304-3975(00)00365-0
  12. S. Kim, K. Park, A Dynamic edit distance table, Proc. 11th Symp. Combinatorial Pattern Matching, LNCS 1848 (2000), 60-68
  13. G.M. Landau, E.W. Myers and J.P. Schmidt, Incremental string comparison, SIAM .J. Computing 27, 2 (1998), 557-582 https://doi.org/10.1137/S0097539794264810
  14. M. Middendorf, More on the complexity of common superstring and supersequence problems, Theoretical Computer Science 125, 2 (1994), 205-228 https://doi.org/10.1016/0304-3975(92)00074-2
  15. K.J. Raihe and E. Ukkonen, The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science 16 (1981), 187-198 https://doi.org/10.1016/0304-3975(81)90075-X