DOI QR코드

DOI QR Code

Minimally Supervised Relation Identification from Wikipedia Articles

  • Oh, Heung-Seon ;
  • Jung, Yuchul
  • Received : 2017.12.08
  • Accepted : 2018.08.08
  • Published : 2018.12.30

Abstract

Wikipedia is composed of millions of articles, each of which explains a particular entity with various languages in the real world. Since the articles are contributed and edited by a large population of diverse experts with no specific authority, Wikipedia can be seen as a naturally occurring body of human knowledge. In this paper, we propose a method to automatically identify key entities and relations in Wikipedia articles, which can be used for automatic ontology construction. Compared to previous approaches to entity and relation extraction and/or identification from text, our goal is to capture naturally occurring entities and relations from Wikipedia while minimizing artificiality often introduced at the stages of constructing training and testing data. The titles of the articles and anchored phrases in their text are regarded as entities, and their types are automatically classified with minimal training. We attempt to automatically detect and identify possible relations among the entities based on clustering without training data, as opposed to the relation extraction approach that focuses on improvement of accuracy in selecting one of the several target relations for a given pair of entities. While the relation extraction approach with supervised learning requires a significant amount of annotation efforts for a predefined set of relations, our approach attempts to discover relations as they occur naturally. Unlike other unsupervised relation identification work where evaluation of automatically identified relations is done with the correct relations determined a priori by human judges, we attempted to evaluate appropriateness of the naturally occurring clusters of relations involving person-artifact and person-organization entities and their relation names.

Keywords

relation identification;Wikipedia mining;unsupervised clustering

References

  1. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
  2. Corro, L. D., & Gemulla, R. (2013). ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference on World Wide Web (pp. 355-365). New York: ACM.
  3. Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (pp. 77-86). Menlo Park: AAAI Press.
  4. Culotta, A., McCallum, A., & Betz, J. (2006). Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 296-303). Stroudsburg: Association for Computational Linguistics.
  5. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S.,…Yates, A. (2005). Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence, 165(1), 91-134. https://doi.org/10.1016/j.artint.2005.03.001
  6. Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 1535-1545). Stroudsburg: Association for Computational Linguistics.
  7. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
  8. Fradkin, D., & Morchen, F. (2015). Mining sequential patterns for classification. Knowledge and Information Systems, 45(3), 731-749. https://doi.org/10.1007/s10115-014-0817-0
  9. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence (pp. 1301-1306). Menlo Park: AAAI Press.
  10. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (pp. 1606-1611). San Francisco: Morgan Kaufmann Publishers.
  11. Hasegawa, T., Sekine, S., & Grishman, R. (2004). Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (pp. 415-422). Stroudsburg: Association for Computational Linguistics.
  12. Jinxiu, C., Donghong, J., Lim, T. C., & Zhengyu, N. (2005). Unsupervised feature selection for relation extraction. In R. Dale, K. F. Wong, J. Su, & O.Y. Kwong (Eds.), Natural Language Processing: IJCNLP 2005 (pp. 390-401). Berlin: Springer.
  13. Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003-1011). Stroudsburg: Association for Computational Linguistics.
  14. Nguyen, D. P. T., Matsuo, Y., & Ishizuka, M. (2007). Relation extraction from Wikipedia using subtree mining. In Proceedings of the 22nd National Conference on Artificial Intelligence (pp. 1414-1420). Menlo Park: AAAI Press.
  15. Pantel, P., & Pennacchiotti, M. (2006). Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 113-120). Stroudsburg: Association for Computational Linguistics.
  16. Parikh, A. P., Poon, H., & Toutanova, K. (2015). Grounded semantic parsing for complex knowledge extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 756-766). Stroudsburg: Association for Computational Linguistics.
  17. Poon, H., Toutanova, K., & Quirk, C. (2015). Distant supervision for cancer pathway extraction from text. In Pacific Symposium on Biocomputing Co-Chairs (pp. 120-131). Singapore: World Scientific.
  18. Rozenfeld, B., & Feldman, R. (2006). High-performance unsupervised relation extraction from large corpora. In Proceedings of Sixth International Conference on Data Mining (ICDM'06) (pp. 1032-1037). Piscataway: IEEE.
  19. Rosenfeld, B., & Feldman, R. (2007). Clustering for unsupervised relation identification. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (pp. 411-418). New York: Association for Computing Machinery.
  20. Shinyama, Y., & Sekine, S. (2006). Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 304-311). Stroudsburg: Association for Computational Linguistics.
  21. Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197. https://doi.org/10.1016/0022-2836(81)90087-5
  22. Strube, M., & Ponzetto, S. P. (2006). WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence (pp. 1419-1424). Menlo Park: AAAI Press.
  23. Sukthanker, R., Poria, S., Cambria, E., & Thirunavukarasu, R. (2018). Anaphora and coreference resolution: A review. Retrieved September 2, 2018 from https://arxiv.org/ pdf/1805.11824.pdf.
  24. Varma, P., He, B., Iter, D., Xu, P., Yu, R., De Sa, C., & Re, C. (2016). Socratic learning: Augmenting generative models to incorporate latent subsets in training data. Retrieved September 2, 2018 from https://arxiv.org/abs/1610.08123.
  25. Wu, F., & Weld, D. S. (2007). Autonomously semantifying Wikipedia. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (pp. 41-50). New York: Association for Computing Machinery.
  26. Wu, F., & Weld, D. S. (2008). Automatically refining the Wikipedia infobox ontology. In Proceedings of the 17th International Conference on World Wide Web (pp. 634-644). New York: Association for Computing Machinery.
  27. Yan, X., Mou, L., Li, G., Chen, Y., Peng, H., & Jin, Z. (2015). Classifying relations via long short term memory networks along shortest dependency path. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1785-1794). Stroudsburg: Association for Computational Linguistics.
  28. Yan, Y., Okazaki, N., Matsuo, Y., Yang, Z., & Ishizuka, M. (2009). Unsupervised relation extraction by mining Wikipedia texts using information from the web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1021-1029). Stroudsburg: Association for Computational Linguistics.
  29. Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics (pp. 2335-2344). Sheffield: International Committee on Computational Linguistics.
  30. Zeng, X., He, S., Liu, K., & Zhao, J. (2018). Large scaled relation extraction with reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (pp. 5658-5665). Palo Alto: Association for the Advancement of Artificial Intelligence.
  31. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long shortterm memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 207-212). Stroudsburg: Association for Computational Linguistics.

Acknowledgement

Supported by : National Research Foundation of Korea (NRF)