DOI QR코드

DOI QR Code

Improved First-Phoneme Searches Using an Extended Burrows-Wheeler Transform

확장된 버로우즈-휠러 변환을 이용한 개선된 한글 초성 탐색

  • 김성환 (부산대학교 전자전기컴퓨터공학과) ;
  • 조환규 (부산대학교 정보컴퓨터공학부)
  • Received : 2014.09.05
  • Accepted : 2014.10.22
  • Published : 2014.12.15

Abstract

First phoneme queries are important functionalities that provide an improvement in the usability of interfaces that produce errors frequently due to their restricted input environment, such as in navigators and mobile devices. In this paper, we propose a time-space efficient data structure for Korean first phoneme queries that disassembles Korean strings in a phoneme-wise manner, rearranges them into circular strings, and finally, indexes them using the extended Burrows-Wheeler Transform. We also demonstrate that our proposed method can process more types of query using less space than previous methods. We also show it can improve the search time when the query length is shorter and the proportion of first phonemes is higher.

한글 초성 질의는 내비게이션 시스템이나 모바일 기기와 같이 입력 환경에 제약이 있어 오류가 빈번한 인터페이스 상에서 사용자 편의성 향상을 위하여 제공되는 중요한 기능이다. 본 논문에서는 한글 문자열을 자소 단위로 분해하여 재배열하여 환형 문자열로 변환한 후, 확장된 버로우즈-휠러 변환을 이용하여 색인함으로써 초성 질의 탐색을 위한 시공간 효율적인 자료구조를 제안한다. 또한 실험을 통하여 기존 기법에 비하여 더 적은 공간만을 사용하면서도 보다 다양한 형태의 질의를 처리할 수 있으며, 특히 질의어의 길이가 짧고, 초성의 비율이 높을수록 탐색 속도가 향상됨을 확인하였다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Y. Kim, D. Kang, S. Park, and D. Ra, "A Fast Search Method for Korean Chosung Wildcard Queries Using Chosung-First Lexicographic Order," Journal of KIISE: Computing Practices and Letters, Vol. 17, No. 10, pp. 527-535, Oct. 2011. (in Korean)
  2. M. Burrows and D. J. Wheeler, "A Block Sorting Data Compression Algorithm," TR 124, Digital Equipment Corporation, 1994.
  3. P. Ferragina and G. Manzini, "Opportunistic Data Structures with Applications," Proc. of the 41th Symposium on Foundations of Computer Science, pp. 390-398, 2000.
  4. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino, "An Extension of the Burrows-Wheeler Transform," Theoretical Computer Science, Vol. 387, pp. 298-312, Nov. 2007. https://doi.org/10.1016/j.tcs.2007.07.014
  5. M. J. Bauer, A. J. Cox, and G. Rosone, "Lightweight Algorithms for Constructing and Inverting the BWT of String Collections," Theoretical Computer Science, Vol. 483, pp. 134-148, Apr. 2013. https://doi.org/10.1016/j.tcs.2012.02.002
  6. P. Ferragina and R. Venturini, "The Compressed Permuterm Index," ACM Transactions on Algorithms, Vol. 8, No. 1, Article No. 10, Nov. 2010.
  7. S. Gog, Succinct Data Structure Library [Online]. Available: https://github.com/simongog/sdsl (downloaded 2014, Mar. 25)
  8. R. Raman, V. Raman, and S. S. Rao, "Succinct indexable dictionaries with applications to encoding k-ary trees and multisets," Proc. of the 13th ACMSIAM Symposium on Discrete Algorithms, pp. 233-242, 2002.
  9. P. Ferragina, G. Manzini, V. Makinen, and G. Navarro, "An Alphabet-Friendly FM-Index," Proc. of the 11th String Processing and Information Retrieval, pp. 150-160, 2004.
  10. Marisa-trie [Online]. Available: http://code.google.com/p/marisa-trie (downloaded 2014, May. 8)
  11. Tx-trie [Online]. Available: https://code.google.com/p/tx-trie (downloaded 2014, May. 8)
  12. DASTrie [Online] Available: http://www.chokkan.org/software/dastrie (downloaded 2014, May. 8)