A Study on the Development of Search Algorithm for Identifying the Similar and Redundant Research

유사과제파악을 위한 검색 알고리즘의 개발에 관한 연구

  • 박동진 (공주대학교 산업시스템공학과) ;
  • 최기석 (한국과학기술정보연구원) ;
  • 이명선 (한국과학기술정보연구원) ;
  • 이상태 (한국표준과학연구원 전산정보팀)
  • Published : 2009.11.28


To avoid the redundant investment on the project selection process, it is necessary to check whether the submitted research topics have been proposed or carried out at other institutions before. This is possible through the search engines adopted by the keyword matching algorithm which is based on boolean techniques in national-sized research results database. Even though the accuracy and speed of information retrieval have been improved, they still have fundamental limits caused by keyword matching. This paper examines implemented TFIDF-based algorithm, and shows an experiment in search engine to retrieve and give the order of priority for similar and redundant documents compared with research proposals, In addition to generic TFIDF algorithm, feature weighting and K-Nearest Neighbors classification methods are implemented in this algorithm. The documents are extracted from NDSL(National Digital Science Library) web directory service to test the algorithm.


Similar Redundant Proposal;Search Engine;TFIDF;KNN


  1. 과학기술정보통합서비스,
  2. 국가과학기술종합정보서비스,
  3. 중복지원방지시스템,
  4. Goffinet L. and Noirhomme-Fraiture M. (1995) Automatic hypertext link generation based on similarity measures between documents, Research Paper, RP-96-034, Institut d'Informatique, FUNDP. Available at (visited November, 2002).
  5. 최준영, 배환국, 김기태, "하이퍼링크 정보를 이용한 웹문서의 핵심어 추출 및 개념구성," 98 ES 및 MIS 춘계학회 자료집, 1998.
  6. T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, In Proc. of the European Conference on Machine Learning, Springer, 1998.
  7. Y. Yang and X Liu, A reexamination of text categorization methods, In SIGIR-99, 1999.
  8. 이종운 "사례기반추론을 이용한 한글 문서분류 시스템의 성능 향상에 관한 연구", 아주대학교 대학원 경영정보학과 석사학위논문, 2001.
  9. F. Debole and F. Sebastiani, Supervised tern weighting for automated text categorization, In Proc. of SAC-03, 18th ACM Symposium of Applied Computing, pp.784-788, 2003.

Cited by

  1. Quantification of Similarity Using the Edit-distance Method for Searching Cooperative Programs Related to Disaster and Safety Management vol.18, pp.3, 2018,