DOI QR코드

DOI QR Code

Similar Patent Search Service System using Latent Dirichlet Allocation

잠재 의미 분석을 적용한 유사 특허 검색 서비스 시스템

  • Lim, HyunKeun (Department of Computer Engineering, Paichai University) ;
  • Kim, Jaeyoon (Department of Computer Engineering, Paichai University) ;
  • Jung, Hoekyung (Department of Computer Engineering, Paichai University)
  • Received : 2018.05.09
  • Accepted : 2018.06.04
  • Published : 2018.08.31

Abstract

Keyword searching used in the past as a method of finding similar patents, and automated classification by machine learning is using in recently. Keyword searching is a method of analyzing data that is formalized through data refinement. While the accuracy for short text is high, long one consisted of several words like as document that is not able to analyze the meaning contained in sentences. In semantic analysis level, the method of automatic classification is used to classify sentences composed of several words by unstructured data analysis. There was an attempt to find similar documents by combining the two methods. However, it have a problem in the algorithm w the methods of analysis are different ways to use simultaneous unstructured data and regular data. In this paper, we study the method of extracting keywords implied in the document and using the LDA(Latent Semantic Analysis) method to classify documents efficiently without human intervention and finding similar patents.

유사 특허를 검색하는 방법으로 기존에는 키워드 검색 방법을 사용하고 최근에는 머신러닝을 활용한 자동분류 방법을 사용하고 있다. 키워드 검색은 데이터 정제를 통해 정형화된 데이터 분석 방법으로 단문일 경우 검색에서는 정확도는 높지만 문서와 같이 여러 단어로 이루어진 장문일 경우 문장에 내포된 의미 분석을 할 수 없었다. 의미 분석 단계에서의 자동 분류 방법은 비정형 데이터 분석 방법으로 여러 단어로 이루어진 문장을 분류하는데 사용되고 있다. 그 동안 두 가지 방법을 결합하여 유사 문서 검색을 하려는 시도가 있었지만 비정형 데이터와 정형 데이터의 동시 사용에는 분석하는 방법이 다르기 때문에 동시 적용에는 알고리즘 상의 문제가 있었다. 이에 본 논문에서는 문서에서 함축된 키워드를 검출하고 잠재 의미 분석(LDA) 방식을 사용하여 사람이 개입하지 않고 문서를 효율적으로 자동분류하고 유사 특허를 검색할 수 있는 방법을 연구하였다.

Keywords

References

  1. Suhendra, I. Ranggadara, "Naive Bayes Algorithm with Chi Square and NGram Feature for Reviewing Laptop Product on Amazon Site," International Research Journal of Computer Science, Issue 12, vol. 4, pp. 28-33, Dec. 2017.
  2. J. W. Lee, I. S. Kang, H. K. Jung, "XML Document Keyword Weight Analysis based Paragraph Extraction Model," Journal of the Korea Institute of Information and Communication Engineering, vol. 21, no. 11, pp. 2133-2138, Nov. 2017. https://doi.org/10.6109/JKIICE.2017.21.11.2133
  3. K. H. Song, Y. S. Kim, "Automatic Keyword Extraction using Hierarchical Graph Model Based on Word Co-occurrences," Journal of Korean Institute of Information Scientists and Engineers, vol. 44, no. 5, pp. 522-536, May. 2017.
  4. S. R. Lim, Y. J. Kwon, "IPC Multi-label Classification based on Functional Characteristics of Fields in Patent Documents," Journal of Internet Computing and Services, vol. 18, no. 1, pp. 77-88, Feb. 2017. https://doi.org/10.7472/jksii.2017.18.1.77
  5. T. H. Jeen, "Patent documents automatic classification with dimension reduced features using latent semantic analysis," M. S. dissertation, Computer and Information Technology, Korea University, Feb. 2014.
  6. R. Mehrotra, S. Sanner, W. Buntine, L. Xie, "Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling," ACM Special Interest Group on Information Retrieval, pp. 889-892, Jul. 2013.
  7. W. S. Kim, S. Y. Kim, "Document Clustering Technique by K-means Algorithm and PCA," Journal of the Korea Institute of Information and Communication Engineering, vol. 18, no. 3, pp. 625-630, Mar. 2014. https://doi.org/10.6109/jkiice.2014.18.3.625