DOI QR코드

DOI QR Code

Sentence Similarity Measurement Method Using a Set-based POI Data Search

집합 기반 POI 검색을 이용한 문장 유사도 측정 기법

  • 고은별 (숙명여자대학교 멀티미디어과학과) ;
  • 이종우 (숙명여자대학교 멀티미디어과학과)
  • Received : 2014.09.30
  • Accepted : 2014.10.23
  • Published : 2014.12.15

Abstract

With the gradual increase of interest in plagiarism and intelligent file content search, the demand for similarity measuring between two sentences is increasing. There is a lot of researches for sentence similarity measurement methods in various directions such as n-gram, edit-distance and LSA. However, these methods have their own advantages and disadvantages. In this paper, we propose a new sentence similarity measurement method approaching from another direction. The proposed method uses the set-based POI data search that improves search performance compared to the existing hard matching method when data includes the inverse, omission, insertion and revision of characters. Using this method, we are able to measure the similarity between two sentences more accurately and more quickly. We modified the data loading and text search algorithm of the set-based POI data search. We also added a word operation algorithm and a similarity measure between two sentences expressed as a percentage. From the experimental results, we observe that our sentence similarity measurement method shows better performance than n-gram and the set-based POI data search.

최근 논문 표절 논란과 지능형 텍스트 검색서비스에 대한 관심이 증가하면서 문장 유사도 측정의 필요성이 증가하고 있다. n-gram, 편집거리, LSA 등 기존의 다양한 방향으로 선행 연구가 있었지만 각 기법마다 장단점이 존재한다. 본 논문에서는 집합 기반 POI 검색 기법을 이용한 새로운 방향의 문장 유사도 측정 기법을 제안한다. 집합 기반 POI 검색 기법은 하드매칭에 비해 단어의 도치, 누락, 삽입, 변경에 현저한 성능 향상을 보인다. 이 기법을 이용하면 보다 정확하고 빠른 문장 유사도 측정이 가능하다. 제안하는 기법은 기존 집합 기반 POI 검색 기법의 데이터 로딩 알고리즘과 텍스트 검색 알고리즘을 변형하고 어절 연산 알고리즘을 추가하여 두 문장의 유사도를 백분율로 표현한다. 실험을 통해 본 논문에서 제시하는 기법이 정확도와 속도에서 n-gram과 기존 집합 기반 POI 검색 기법에 비해 우수함을 확인하였다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. E. J. Oh, "Exploring the Information Ethics and Plagiarism of University Students," International Journal of Creativity & Problem Solving, Vol. 9, No. 3, pp. 163-184, Jan. 2013. (in Korean)
  2. J. K. Cho, S. E. Ha, "Effective Scheme for File Search Engine in Mobile Environments," International Jounal of Contents, Vol. 8, No. 11, pp. 41-48, Nov. 2008. (in Korean) https://doi.org/10.5392/JKCA.2008.8.11.041
  3. J. I. Kim, "Efficient Edit Similarity Search Technique Using Prefix Element Selection," Journal of KIISE : Computing Practices and Letters, Vol. 18, No. 9, pp. 654-659, Sep. 2012. (in Korean)
  4. D. J. Kim, H. W. Kim, "Context-Weighted Metrics for Example Matching," Journal of the Institute of Electronics Engineers of Korea, Vol. 43, No. 6, pp. 43-51, Nov. 2006. (in Korean)
  5. H. S. Ji, J. H. Joh, H. S. Lim, "A Detection Method of Similar Sentences Considering Plagiarism Patterns of Korean Sentence," Journal of the Korean Association of Computer Education, Vol. 13, No. 6, pp. 79-89, Nov. 2010. (in Korean)
  6. E. B. Go, J. W. Lee, J. W. Lee, "An Efficient Set-based POI Search Algorithm," Journal of KIISE : Computing Practices and Letters, Vol. 19, No. 5, pp. 242-251, May. 2013. (in Korean)
  7. E. B. Ko and J. W. Lee, "Implementation of A Setbased POI Search Algorithm Supporting Classifying Duplicate Characters," Journal of Digital Contents Society, Vol. 14, No. 4, pp. 465-471, Dec. 2013. (in Korean)
  8. A. Y. Jin, J. W. Lee, J. W. Lee, "Measuring Method of String Similarity for POI Data Retrieval," Journal of KIISE : Computing Practices and Letters, Vol. 19, No. 4, pp. 177-185, Apr. 2013. (in Korean)