DOI QR코드

DOI QR Code

A Distance Approach for Open Information Extraction Based on Word Vector

  • Liu, Peiqian (School of Computer Science, Beijing University of Posts and Telecommunications) ;
  • Wang, Xiaojie (School of Computer Science, Beijing University of Posts and Telecommunications)
  • Received : 2017.05.29
  • Accepted : 2018.01.30
  • Published : 2018.06.30

Abstract

Web-scale open information extraction (Open IE) plays an important role in NLP tasks like acquiring common-sense knowledge, learning selectional preferences and automatic text understanding. A large number of Open IE approaches have been proposed in the last decade, and the majority of these approaches are based on supervised learning or dependency parsing. In this paper, we present a novel method for web scale open information extraction, which employs cosine distance based on Google word vector as the confidence score of the extraction. The proposed method is a purely unsupervised learning algorithm without requiring any hand-labeled training data or dependency parse features. We also present the mathematically rigorous proof for the new method with Bayes Inference and Artificial Neural Network theory. It turns out that the proposed algorithm is equivalent to Maximum Likelihood Estimation of the joint probability distribution over the elements of the candidate extraction. The proof itself also theoretically suggests a typical usage of word vector for other NLP tasks. Experiments show that the distance-based method leads to further improvements over the newly presented Open IE systems on three benchmark datasets, in terms of effectiveness and efficiency.

Keywords

Acknowledgement

Supported by : National Natural Science Foundation of China

References

  1. Oren Etzioni, Anthony Fader, "Open Information Extraction: the Second Generation," in Proc. of Int. Joint Conf. on Artificial Intelligence , vol.8, no.4, pp.3-10, August 3-9, 2013
  2. Michele Banko, Michael J. Cafarella, "Open information extraction from the web." in Proc. of Int. Joint Conf. on Artificial Intelligence , vol.12, no.51, pp.68-74, January 6-12, 2007.
  3. Michele Banko and Oren Etzioni, "The tradeoffs between open and traditional relation extraction," in Proc. of Annual Meeting of the Association for Computational Linguistics," pp.28-36 June 15-20, 2008.
  4. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, "StatSnowball: a statistical approach to extracting entity relationships," in Proc. of Int. Conf. on World Wide Web, pp.101-110, April 20-24, 2009.
  5. Fei Wu, Daniel S. Weld, "Open information extraction using Wikipedia," in Proc. of Annual Meeting of the Association for Computational Linguistics, pp.118-127, July 11-16, 2010.
  6. M. Schmitz, R. Bart, S. Soderland, O. Etzioni, "Open language learning for information extraction," in Proc. of Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 523-534 , July 12-14, 2012.
  7. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell, "Toward an Architecture for Never-Ending Language Learning," in Proc. of Conf. on Artificial Intelligence, pp. 1306-1313, July 11-15, 2010.
  8. A. Akbik and J. Bross, "Wanderlust: Extracting semantic relations from natural language text using dependency grammar patterns," in Proc. of IEEE Int. Conf. on Data Mining pp. 6-15, December 6-10, 2009.
  9. F. M. Suchanek, M. Sozio, and G. Weikum, "SOFIE: a self-organizing framework for information extraction," in Proc. of Int. Conf. on World Wide Web, pp. 631-640, April 20-24, 2009.
  10. N. Nakashole, M. Theobald, and G. Weikum, "Scalable knowledge harvesting with high precision and high recall," in Proc. of ACM Int. Conf. on Web Search and Data Mining, pp. 227-236, February 9-12, 2011.
  11. N. Nakashole, G. Weikum, and F. Suchanek, "Discovering Semantic Relations from the Web and Organizing them with PATTY," ACM SIGMOD Record, vol. 42, no. 2, pp. 29-34, June, 2013. https://doi.org/10.1145/2503792.2503799
  12. F. Mesquita, "Clustering techniques for open relation extraction," in Proc. of the SIGMOD/PODS 2012 PhD Symposium, pp. 27-32, May 20th, 2012.
  13. F. Mesquita, J. Schmidek, and D. Barbosa, "Effectiveness and Efficiency of Open Relation Extraction," in Proc. of the 2013 Conf. on Empirical Methods in Natural Language Processing, pp. 447-457, October 18-21, 2013.
  14. M. Miwa, R. Saetre, Y. Miyao, and J. Tsujii, "Entity-focused sentence simplification for relation extraction," in Proc. of the 23rd International Conf. on Computational Linguistics, pp. 788-796, August 23-27, 2010.
  15. I. Segura-Bedmar, P. Martnez and C. de Pablo-Sanchez, "A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents," in Proc. of the 4th Int.Workshop on Data and Text Mining in Biomedical Informatics, vol. 12, no. Suppl 2, pp. 1-11, March 29, 2011.
  16. J. Schmidek, D. Barbosa, "Improving Open Relation Extraction via Sentence Re-Structuring," in Proc. of Int. Conf. on Language Resources and Evaluation, pp.3720-3723, May 26-31, 2014.
  17. G. Angeli, M. J. Premkumar, and C. D. Manning, "Leveraging Linguistic Structure For Open Domain Information Extraction," in Proc. of Annual Meeting of the Association for Computational Linguistics, pp.344-354, July 26-31, 2015.
  18. L. Del Corro, R. Gemulla, "ClausIE: clause-based open information extraction," in Proc. of the Int. Conf. on World Wide Web, pp. 355-366, May 13-17, 2013.
  19. R. Chandrasekar, C. Doran, and B. Srinivas, "Motivations and methods for text simplification," in Proc. of the 16th Conf. on Computational Linguistics,vol.2, pp. 1041-1044, August 5-9, 1996.
  20. I. Dornescu, R. Evans, and C. Orasan, "Relative clause extraction for syntactic simplification," in Proc. of the Workshop on Automatic Text Simplification-Methods and Applications in the Multilingual Society, pp. 1-10, August 24th, 2014.
  21. Ade Romadhony, Dwi H. Widyantoro, Ayu Purwariant, "Phrase-based Clause Extraction for Open Information Extraction System," in Proc. of the 7th International Conf. on Advanced Computer Science and Information Systems, pp.156-162, October 10-11, 2015.
  22. Gamallo P, Garcia M, "Multilingual Open Information Extraction," in Proc. of the 17th Portuguese Conf. on Artificial Inteligence, pp.711-722, September 8-11, 2015.
  23. Janara Christensen, Mausam, "An analysis of open information extraction based on semantic role labeling," in Proc. of International Conf. on Knowledge Capture, vol.34, pp. 113-120, June 23-26, 2011.
  24. Mausam, "Open Information Extraction Systems and Downstream Applications," in Proc. of the 25th International Joint Conf. on Artificial Intelligence, pp.4074-4077, July 9-15, 2016.
  25. Tomas Mikolov, Kai Chen, "Efficient Estimation of Word Representations in Vector Space," in Proc. of Workshop at International Conf. on Learning Representations, pp. 65-76, May 2-4, 2013.
  26. A. Siddharthan, "A survey of research on text simplification," International Journal of Applied Linguistics, vol. 165, no. 2, pp. 259-298, March, 2014.
  27. A. Akbik and A. Loser, "Kraken: N-ary facts in open information extraction," in Proc. of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pp. 52-56, June 7-8, 2012,.
  28. AmalZouaq, MichelGagnon, "An assessment of open relation extraction systems for the semantic web," Information Systems, vol. 71, pp.228-239, November, 2017. https://doi.org/10.1016/j.is.2017.08.008
  29. Ronan Collobert and Jason Weston, "A unified architecture for natural language processing: deep neural networks with multitask learning," in Proc. of the 25th Int. Conf. on Machine Learning, pp.160-167, July 5-9, 2008.
  30. Peter D. Turney, "Distributional semantics beyond words: Supervised learning of analogy and paraphrase," Transactions of the Association for Computational Linguistics, vol. 1, pp. 353-366, February, 2013.
  31. Holger Schwenk, "Continuous space language models," Computer Speech and Language, vol. 21, No.3, pp. 492-518, July, 2007. https://doi.org/10.1016/j.csl.2006.09.003
  32. Tomas Mikolov, "Statistical Language Models Based on Neural Networks," PhD Thesis, Brno University of Technology, 2012.
  33. Tomas Mikolov, Kai Chen, "Efficient estimation of word representations in vector space," in Proc. of Workshop at Int. Conf. on Learning Representations, pp. 65-76, May 2-4, 2013.
  34. Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig, "Linguistic Regularities in Continuous Space Word Representations," in Proc. of Conf. of the North American Chapter of the Association for Computational Linguistics, pp. 746-751, June 9-14, 2013.
  35. Harinder Pal, Mausam, "Demonyms and Compound Relational Nouns in Nominal Open IE," in Proc. of the 5th Workshop on Automated Knowledge Base Construction, pp 35-39, June 17th, 2016.