DOI QR코드

DOI QR Code

Semantic Extention Search for Documents Using the Word2vec

Word2vec을 활용한 문서의 의미 확장 검색방법

  • 김우주 (연세대학교 정보산업공학과) ;
  • 김동희 (한국철도기술연구원) ;
  • 장희원 (연세대학교 정보산업공학과)
  • Received : 2016.09.27
  • Accepted : 2016.10.10
  • Published : 2016.10.28

Abstract

Conventional way to search documents is keyword-based queries using vector space model, like tf-idf. Searching process of documents which is based on keywords can make some problems. it cannot recogize the difference of lexically different but semantically same words. This paper studies a scheme of document search based on document queries. In particular, it uses centrality vectors, instead of tf-idf vectors, to represent query documents, combined with the Word2vec method to capture the semantic similarity in contained words. This scheme improves the performance of document search and provides a way to find documents not only lexically, but semantically close to a query document.

Keywords

Semantic Search;Document Feature Vector;Vector Space Model;Word2vec

Acknowledgement

Supported by : 한국철도기술연구원

References

  1. S. Brin and L. Page, "The Anatomy of a Large-scale Hypertextual Web Search Engine," Computer Networks and ISDN Systems, Vol.33, pp.107-117, 1998.
  2. T. Mikolov, K. Chen, G. Corrado, and J, Dean "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
  3. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Distributed representations of words and phrases and theier compositionality," Advances in neural information processing systems, 2013.
  4. Yoshua Bengio, New distributed probabilistic language models. Dept. IRO, University de Montreal, Montreal, QC, Canada, Tech. Rep, 1215, 2002.
  5. Yoshua Bengio and Samy Bengio, "Modeling high-dimensional discrete data with multi-layer neural networks," In NIPS, Vol.99, pp.400-406, 1999.
  6. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model," The Journal of Machine Learning Research, Vol.3, pp.1137-1155, 2003.
  7. Yoshua Bengio and Jean-Sebastien Senecal, et al. Quick training of probabilistic neural nets by importance sampling, In AISTATS Conference, 2003.
  8. Gerard Salton, Anita Wong, and Chung-Shu Yang, "A vector space model for automatic indexing," Communication of the ACM, Vol.18, No.11, pp.613-620, 1975. https://doi.org/10.1145/361219.361220
  9. David Dubin, The most inuential paper gerard salton never wrote, 2004.
  10. Ronan Collobert and Jason Weston, A unied architecture for natural language processing: Deep neural networks with multitask learning, In Proceedings of the 25th international conference on Machine learning, pp.160-167, ACM, 2008.