DOI QR코드

DOI QR Code

Terminology Recognition System based on Machine Learning for Scientific Document Analysis

과학 기술 문헌 분석을 위한 기계학습 기반 범용 전문용어 인식 시스템

  • Received : 2011.06.27
  • Accepted : 2011.08.17
  • Published : 2011.10.31

Abstract

Terminology recognition system which is a preceding research for text mining, information extraction, information retrieval, semantic web, and question-answering has been intensively studied in limited range of domains, especially in bio-medical domain. We propose a domain independent terminology recognition system based on machine learning method using dictionary, syntactic features, and Web search results, since the previous works revealed limitation on applying their approaches to general domain because their resources were domain specific. We achieved F-score 80.8 and 6.5% improvement after comparing the proposed approach with the related approach, C-value, which has been widely used and is based on local domain frequencies. In the second experiment with various combinations of unithood features, the method combined with NGD(Normalized Google Distance) showed the best performance of 81.8 on F-score. We applied three machine learning methods such as Logistic regression, C4.5, and SVMs, and got the best score from the decision tree method, C4.5.

문헌에서의 전문용어 인식 연구는 정보검색, 정보추출, 시맨틱 웹, 질의응답 분야 등의 연구를 위한 선행 연구로서, 지금까지 대부분 특정 분야, 특히 생의학 분야에서 집중되어 연구되어 왔다. 그러나 기존 연구들이 특정 도메인 또는 문헌 내부 통계 정보를 활용함으로써 범용적인 전문용어 인식에 한계점을 보여 왔기 때문에, 본 연구에서는 웹 검색 결과와 사전, 후보용어의 문형 특징 등을 활용하는 기계 학습 기반 범용 전문용어 인식 방법을 제안하였다. 제안한 방법을 문헌의 지역 통계 정보를 사용하는 방법(C-value)과 비교 실험하여 80.8%의 F-값으로 6.5%의 성능향상을 보였다. 다양한 응집도 자질들을 접목한 두 번째 실험에서는 Normalized Google Distance 방법과 접목한 방식이 F-값 81.8%의 성능으로 최고의 성능을 나타냈다. 기계 학습 방법으로는 로지스틱 회귀분석, C4.5, SVMs 등을 적용하였는데, 일반적으로 이진 분류에 좋은 성능을 보이는 SVMs과 로지스틱 회귀분석 방법보다 결정 트리 방식의 C4.5가 전반적으로 좋은 성능을 보였다.

Keywords

References

  1. Beatrice Daille, Eric Gaussier, and Jean-Marc Lange, "Towards Automatic Extraction of Monolingual and Bilingual Terminology. COLING-94, 1994.
  2. Church, K. & Hanks. P, "Word association norms, mutual information, and lexicography," Computational Linguistics, Vol.16, No.1, pp.22-29, 1990.
  3. Corinna Cortes and V. Vapnik, "Support-Vector Networks", Machine Learning, Vol.20, No.3, pp-273-297, 1995.
  4. Dunning, T. "Accurate methods for the statistics of surprise and coincidence," Computational Linguistics, Vol.19, No.1, pp.61-74, 1993.
  5. F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, "Translating collocations for bilingual lexicons: A statistical approach", Computational Linguistics, Vol.22, No.1, pp.1-38, 1996.
  6. G. Zhou, J. Zhang, J. Su, D. Shen and C. Tan, "Recognizing names in biomedical texts: a machine learning approach," Bioinformatics, Vol.20, No.7, pp.1178-1190, 2004. https://doi.org/10.1093/bioinformatics/bth060
  7. Ido Dagan and Kenneth W. Church, "Termight: Identifying and translating technical terminology," ANLP, pp.34-40, 1994.
  8. J. Kazama, T. Makino, Y. Ohta, J. Tsujii, "Tuning support vector machines for biomedical named entity recognition," Proceedings of the ACL-02 workshop on NLP in the biomedical domain, Vol.3, pp.1-8, 2002. https://doi.org/10.3115/1118149.1118150
  9. Justeson, J.S. and S.M. Katz, "Technical terminology : some lingustic propertis and an algorithm for identification in text," Natural Language Engineering, Vol.1, No.1, pp.9-27, 1995.
  10. Joachim Wermter and Udo Hahn, "Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms," HLT'05 Proceedings of the conference on Human Language Technology and Empirical Methods in NLP, 2005.
  11. K. Frantzi and S. Ananiadou and Hideki Mima, "Automatic recognition of multi-word terms: the C-value/NC-value method," International Journal on Digital Libraries, Vol.3, No.2, pp.115-130, 2000. https://doi.org/10.1007/s007999900023
  12. LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/-cjlin/libsvm/
  13. Nakagawa, Hiroshi and Tatsunori Mori, "Automatic term recognition based on statistics of compound nouns and their components," Terminology, Vol.9, No.2, pp.201-219, 2003. https://doi.org/10.1075/term.9.2.04nak
  14. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
  15. Rudi Cilibrasi and Paul Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Engineering, Vol.19, No.3, pp.370-383, 2007. https://doi.org/10.1109/TKDE.2007.48
  16. Qing T. Zeng, Tony Tse, et. al., "Term identification methods for consumer health vocabulary development," Journal of medical Internet research, Vol.9, No.1, 2007.
  17. WEKA - Data Mining Software in Java, http:// www.cs.waikato.ac.nz/ml/weka/
  18. Y. Tseng, C. Lin, Y. Lin, "Text mining techniques for patent analysis," Information Processing and Management, Vol.43, No.5, pp.1216-1247, 2007. https://doi.org/10.1016/j.ipm.2006.11.011

Cited by

  1. Machine Learning Process for the Prediction of the IT Asset Fault Recovery vol.2, pp.4, 2013, https://doi.org/10.3745/KTSDE.2013.2.4.281