DOI QR코드

DOI QR Code

Using Ontologies for Semantic Text Mining

시맨틱 텍스트 마이닝을 위한 온톨로지 활용 방안

  • 유은지 (국민대학교 비즈니스IT전문대학원) ;
  • 김정철 (국민대학교 비즈니스IT전문대학원) ;
  • 이춘열 (국민대학교 경영정보학부) ;
  • 김남규 (국민대학교 경영정보학부)
  • Received : 2012.08.09
  • Accepted : 2012.09.06
  • Published : 2012.09.30

Abstract

The increasing interest in big data analysis using various data mining techniques indicates that many commercial data mining tools now need to be equipped with fundamental text analysis modules. The most essential prerequisite for accurate analysis of text documents is an understanding of the exact semantics of each term in a document. The main difficulties in understanding the exact semantics of terms are mainly attributable to homonym and synonym problems, which is a traditional problem in the natural language processing field. Some major text mining tools provide a thesaurus to solve these problems, but a thesaurus cannot be used to resolve complex synonym problems. Furthermore, the use of a thesaurus is irrelevant to the issue of homonym problems and hence cannot solve them. In this paper, we propose a semantic text mining methodology that uses ontologies to improve the quality of text mining results by resolving the semantic ambiguity caused by homonym and synonym problems. We evaluate the practical applicability of the proposed methodology by performing a classification analysis to predict customer churn using real transactional data and Q&A articles from the "S" online shopping mall in Korea. The experiments revealed that the prediction model produced by our proposed semantic text mining method outperformed the model produced by traditional text mining in terms of prediction accuracy such as the response, captured response, and lift.

Keywords

References

  1. 김인현, "빅데이터 가치와 도입 전략," 2012 Big Data 검색 분석 기술 Insight, 보고서, 2012.
  2. 김형도, 김종우, "기업간 비즈니스 프로세스 메타 데이터 온톨로지 설계," 한국IT서비스학회 2006년 추계학술대회, 2006.
  3. 노상규, 박진수, 인터넷 진화의 열쇠 온톨로지, 가즈토이, 2007.
  4. 손윤호, 김인규, 김남규, "연관규칙 마이닝을 활용한 개념적 데이터베이스 설계 자동화 기법," 정보시스템연구, 제18권, 제4호, 2009, pp.59-86. https://doi.org/10.5859/KAIS.2009.18.4.059
  5. 안성준, 김우주, 박상언, "최적 온톨로지 매핑 방법론에 관한 연구," 한국지능정보시스템학회 2007년 추계학술대회 논문집, 2007. pp.457-462.
  6. 유지연, "세계경제포럼(WEF)을 통해 본 빅데이터 논의 동향과 함의," 정보통신정책연구원 방송통신정책, 제24권, 제4호, 2012.
  7. 이동훈, 김남규, 정인환, "온톨로지와 개체관계 모델의 상호운용성에 대한 연구," Journal of Information Technology Applications and Management, 제18권, 제4호, 2011. pp.95-118.
  8. 정윤수, 이춘열, 김남규, "토픽맵의 다중역할 토픽 보존을 위한 관계형 데이터베이스 구조," 정보시스템연구, 제18권, 제3호, 2009, pp.327-349.
  9. 최광선, "SNS 시대의 하이브리드 빅데이터 분석 기술 및 사례," 2012 Big Data 검색 분석 기술 Insight, 보고서, 2012.
  10. 홍준석, "시맨틱 웹에서의 효율적인 온톨로지 추론을 위한 개선방법에 관한 연구," 한국전자거래학회지, 제13권, 제3호, 2008, pp.85-101.
  11. 홍태호, 김진완, "데이터 마이닝의 비대칭 오류비용을 이용한 지능형 침입탐지시스템 개발," 정보시스템연구, 제15권, 제4호, 2006, pp.211-224.
  12. Albright, R., Taming Text with the SVD, SAS Institute Inc., 2006.
  13. Antoniou, G., and Harmelen, F. V. V., A Semantic Web Primer, 2nd edition, The MIT Press, 2008.
  14. Bunge, M. A., Treatise on Basic Philosophy (Volume 3): Ontology I, The Future of the World, D. Reidel Publishing Company, Boston, 1977.
  15. Bunge, M. A., Treatise on Basic Philosophy (Volume 4): Ontology II, A World of Systems, D. Reidel Publishing Company, Boston, 1979.
  16. Fan, W., Wallace, W., Rich, S., and Zhang, Z., "Tapping the Power of Text Mining," Communications of the ACM, Vol.49, No.9, 2006. pp.76-82. https://doi.org/10.1145/1151030.1151032
  17. Gartner, Hype Cycle for Emerging Technologies, 2011, Gartner, 2011.
  18. Gemino, A., and Wand, Y., "Complexity and Clarity in Conceptual Modeling: Comparison of Mandatory and Optional Properties," Data & Knowledge Engineering, Vol.55, No.3, 2005, pp.301-326. https://doi.org/10.1016/j.datak.2004.12.009
  19. Han, J., and Kamber, M., Data Mining: Concepts and Techniques, 2nd, Morgan Kaufmann Publishers, 2006.
  20. Hearst, M. A., "Untangling Text Data Mining," In Proceedings of ACL, 1999, pp.3-10.
  21. Hitzler, P., Krotzsch, M., and Rudolph, S., Foundations of Semantic Web Technologies, CRC Press, 2009.
  22. Horridge, M., A Practical Guide To Building OWL Ontologies Using Protege 4 and CO-ODE Tools, The University of Manchester, 2011.
  23. Jones, A. B., and Weber, R., "Understanding Relationships with Attributes in Entity-Relatioship Diagrams," in Proceedings of the 20th International Conference on Information Systems(ICIS), 1999, pp.241-228.
  24. Maedche, A., Staab, S., Stojanovic, N., Studer, R., and Sure, Y., "SEAL-A Framework for Developing Semantic Web PortALs," in Proceedings of British National Conference on Databases, Vol.2097, 2001, pp.1-22.
  25. Masahide, K., 시맨틱 웹을 위한 RDF/OWL 입문, 홍릉과학출판사, 2008.
  26. Mckinsey, Big Data: The Next Frontier for Innovation, Competition, and Productivity, Mckinsey Global Institute, 2011.
  27. Metzler, D., Bernstein, Y., Crofit, W. B., Moffat, A., and Zobel, J., "Similarity Measures for Tracking Information Flow," in Proceedings of CIKM, 2005, pp.517-524.
  28. Mooney, R. J., and Bunescu, R., "Mining Knowledge from Text using Information Extraction," ACM SIGKDD Explorations, Vol.7, No.1, 2006, pp.3-10.
  29. Rijsbergen, C. J. V., Information Retrieval, 2nd edition, Butterworth, London, 1979.
  30. Salton, G., Wong, A., and Yang, C. S., "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol.18, No.11, pp. 613 - 620, 1975. https://doi.org/10.1145/361219.361220
  31. SAS, Text Analytics with SAS Text Miner Course Notes, SAS Institute Inc., 2010.
  32. Sebastiani, F., "Machine Learning in Automated Text Categorization," ACM Computing Surveys, Vol.34, No.1, 2002, pp.1-47. https://doi.org/10.1145/505282.505283
  33. Sebastiani, F., Classification of Text, Automatic, The Encyclopedia of Language and Linguistics 14, 2nd edition, Elsevier Science Pub., 2006.
  34. Shanks, G., Nuredini, J., Tobin, D., Moody, D., and Weber, R., "Representing Things and Properties in Conceptual Modelling: An Empirical Evaluation," Journal of Database Management, Vol.21, No.2, 2010, pp.1-25.
  35. Shanks, G., Tansley, E., Nuredini, J., Tobin, D., and Weber, R., "Representing Part-Whole Relations in Conceptual Modeling: An Empricial Evaluation," MIS Quarterly, Vol.32, No.3, 2008, pp.553-573.
  36. Spasic, I., Ananiadou, S., Mcnaught, J., and Kumar, A., "Text Mining and Ontologies in Biomedicine: Making Sense of Raw Text," Briefing in Bioinformatics, Vol.6, No.3, 2005, pp.239-251. https://doi.org/10.1093/bib/6.3.239
  37. Spyns, P., Meersman, R., and Jarrar, M., "Data Modelling versus Ontology Engineering," ACM SIGMOD Record, Vol.31, No.4, 2002, pp.12-17. https://doi.org/10.1145/637411.637413
  38. Stanvrianou, A., Andritsos, P., and Nicoloyannis, N., "Overview and Semantic Issues of Text Mining," ACM SIGMOD Record, Vol.36, No.3, 2007, pp.23-34, https://doi.org/10.1145/1324185.1324190
  39. Storey, V. C., "Comparing Relationships in Conceptual Modeling: Mapping to Semantic Classifications," IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.11, 2005, pp.1478-1489. https://doi.org/10.1109/TKDE.2005.175
  40. Wand, Y., Monarchi, D. E., Parsons, J., and Woo, C. C., "Theoretical Foundations for Conceptual Modelling in Information Systems Development," Decision Support Systems, Vol.15, No.4, 1995, pp.285-304. https://doi.org/10.1016/0167-9236(94)00043-6
  41. Wand, Y., and Weber, R., "On the Ontological Expressiveness of Information System Analysis and Design Grammars," Journal of Information Systems, Vol.3, No.4, 1993, pp.217-237. https://doi.org/10.1111/j.1365-2575.1993.tb00127.x
  42. Wand, Y., and Weber, R., "On the Deep Structure of Information Systems," Information System Journal, Vol.5, No.3, 1995, pp.203-223. https://doi.org/10.1111/j.1365-2575.1995.tb00108.x
  43. Witten, I. H., Text Mining, Practical Handbook of Internet Computing, edited by M. P. Singh, CRC Press, 2004.

Cited by

  1. A Study on the Effect of Using Sentiment Lexicon in Opinion Classification vol.20, pp.1, 2014, https://doi.org/10.13088/jiis.2014.20.1.133
  2. An Investigation on Expanding Traditional Sequential Analysis Method by Considering the Reversion of Purchase Realization Order vol.22, pp.3, 2013, https://doi.org/10.5859/KAIS.2013.22.3.25
  3. Analyzing the Issue Life Cycle by Mapping Inter-Period Issues vol.20, pp.4, 2014, https://doi.org/10.13088/jiis.2014.20.4.25
  4. A Methodology for Automatic Multi-Categorization of Single-Categorized Documents vol.20, pp.3, 2014, https://doi.org/10.13088/jiis.2014.20.3.077
  5. Methodology for Issue-related R&D Keywords Packaging Using Text Mining vol.16, pp.2, 2015, https://doi.org/10.7472/jksii.2015.16.2.57