DOI QR코드

DOI QR Code

Detection of Protein Subcellular Localization based on Syntactic Dependency Paths

구문 의존 경로에 기반한 단백질의 세포 내 위치 인식

  • 김미영 (성신여자대학교 컴퓨터정보학부)
  • Published : 2008.08.29

Abstract

A protein's subcellular localization is considered an essential part of the description of its associated biomolecular phenomena. As the volume of biomolecular reports has increased, there has been a great deal of research on text mining to detect protein subcellular localization information in documents. It has been argued that linguistic information, especially syntactic information, is useful for identifying the subcellular localizations of proteins of interest. However, previous systems for detecting protein subcellular localization information used only shallow syntactic parsers, and showed poor performance. Thus, there remains a need to use a full syntactic parser and to apply deep linguistic knowledge to the analysis of text for protein subcellular localization information. In addition, we have attempted to use semantic information from the WordNet thesaurus. To improve performance in detecting protein subcellular localization information, this paper proposes a three-step method based on a full syntactic dependency parser and WordNet thesaurus. In the first step, we constructed syntactic dependency paths from each protein to its location candidate, and then converted the syntactic dependency paths into dependency trees. In the second step, we retrieved root information of the syntactic dependency trees. In the final step, we extracted syn-semantic patterns of protein subtrees and location subtrees. From the root and subtree nodes, we extracted syntactic category and syntactic direction as syntactic information, and synset offset of the WordNet thesaurus as semantic information. According to the root information and syn-semantic patterns of subtrees from the training data, we extracted (protein, localization) pairs from the test sentences. Even with no biomolecular knowledge, our method showed reasonable performance in experimental results using Medline abstract data. Our proposed method gave an F-measure of 74.53% for training data and 58.90% for test data, significantly outperforming previous methods, by 12-25%.

단백질의 세포 내 위치를 인식하는 것은 생물학 현상의 기술에 있어서 필수적이다. 생물학 문서의 양이 늘어남에 따라, 단백질의 세포 내 위치 정보를 문서 내용으로부터 얻기 위한 연구들이 많이 이루어졌다. 기존의 논문들은 문장의 구문 정보를 이용하여 정보를 얻고자 하였으며, 언어학적 정보가 단백질의 세포 내 위치를 인식하는 데 유용하다고 주장하고 있다. 그러나, 이전의 시스템들은 구문 정보를 얻기 위해 부분 구문분석기만을 사용하였고 재현율이 좋지 못했다. 그러므로 단백질의 세포 내 위치 정보를 얻기 위해 전체 구문분석기를 사용할 필요가 있다. 또한, 더 많은 언어학적 정보를 위해 의미 정보 또한 사용이 가능하다. 단백질의 세포 내 위치 정보를 인식하는 성능을 향상시키기 위하여, 본 논문은 전체 구문분석기와 어휘망(WordNet)을 기반으로 한 방법을 제안한다. 첫 번째 단계에서, 각 단백질 단어로부터 그 단백질의 위치후보에까지 이르는 구문 의존 경로를 구축한다. 두 번째 단계에서, 구문의존 경로의 루트 정보를 추출한다. 마지막으로, 단백질 부분트리와 위치 부분트리의 구문-의미 패턴을 추출한다. 구문 의존 경로의 루트와 부분트리로부터 구문태그와 구문방향을 구문 정보로서 추출하고, 각 노드 단어의 의미태그를 의미 정보로서 추출한다. 의미태그로는 어휘망의 동의어 집합(synset)을 사용한다. 학습데이터에서 추출한 루트 정보와 부분트리의 구문-의미 패턴에 따라서, 실험데이터에서 (단백질, 위치) 쌍들을 추출했다. 어떤 생물학적 지식 없이, 본 논문의 방법은 메드라인(Medline) 요약 데이터를 사용한 실험 결과에서 학습데이터에 대해 74.53%의 조화평균(F-measure), 실험데이터에 대해서는 58.90%의 조화평균을 보였다. 이 실험은 기존의 방법들보다 12-25%의 성능향상을 보였다.

Keywords

References

  1. C. Blaschke, L. Hirschman and A. Valencia, “Information Extraction in Molecular Biology,” Briefings in Bioinformatics, Vol.3, pp.154-165, 2002 https://doi.org/10.1093/bib/3.2.154
  2. M. Craven and J. Kumlien, “Constructing Biological Knowledge Bases by Extracting Information from Text Sources,” Proc. of the 7th Int'l Conf. Intelligent Systems for Molecular Biology, AAAI Press, pp.77-86, 1999
  3. M. Goadrich, L. Oliphant and J. Shavlik, “Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction,” Proc. of the $14^{th}$ International Conference on Inductive Logic Programming (ILP), pp.98-115, 2004
  4. D. Page and M. Craven, “Biological Applications of Multi-Relational Data Mining,” ACM SIGKDD Explorations Newsletter, Vol.5, pp.69-79, 2003 https://doi.org/10.1145/959242.959250
  5. M. Skounakis, M. Craven and S. Ray, “Hierarchical Hidden Markov Models for Information Extraction,” Proc. of the 18th International Joint Conference on Artificial Intelligence, pp.427-433, 2003
  6. F. Rinaldi, G. Schneider, K. Kaljurand, M. Hess, C. Andronis, O. Konstanti and A. Persidis, “Mining of Functional Relations between Genes and Proteins over Biomedical Scientific Literature using a Deep-Linguistic Approach,” Artificial Intelligence in Medicine, Vol.39, pp.127-136, 2007 https://doi.org/10.1016/j.artmed.2006.08.005
  7. S. Riedel and E. Klein, “Genic interaction extraction with semantic and syntactic chains,” Proc. of ICML05 Workshop on Learning Language in Logic (LLL05), 2005
  8. M. Goadrich, L. Oliphant and J. Shavlik, “Learning to extract genic interactions using Gleaner,” Proc. of ICML05 Workshop on Learning Language in Logic (LLL05), 2005
  9. B. Stapley, L. Kelley and M. Sternberg, “Predicting the sub-cellular location of proteins from text using support vector machines,” Proc. of the Pacific Symposium on Biocomputing, pp.374-385, 2002
  10. B. Rosario and M. Hearst, “Classifying semantic relations in bioscience texts,” Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp.430-437, 2004
  11. M.Krogel, M. Denecke, M. Landwehr and T. Scheffer, “Combining data and Text Mining Techniques for Yeast Gene Regulation Prediction: A Case Study,” ACM SIGKDD Explorations Newsletter, Vol.4, pp.104-105. 2002 https://doi.org/10.1145/772862.772880
  12. M. Krogel and T. Scheffer, “Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics,” Machine Learning, Vol.57, pp.61-81, 2004 https://doi.org/10.1023/B:MACH.0000035472.73496.0c
  13. K. Lee, D. Kim, D. Na, D. Lee and K. Lee, “PLPD: Reliable Protein Localization Prediction from Imbalanced and Overlapped Datasets,” Nucleic Acids Research, Vol.34, pp.4655-4666, 2006 https://doi.org/10.1093/nar/gkl638
  14. Z. Lu, D. Szafron, R. Greiner, P. Lu, D. S. Wishart, B. Poulin, J. Anvik, C. Macdonell and R. Eisner, “Predicting Subcellular Localization of Proteins using Machine-Learned Classifiers,” Bioinformatics, Vol.20, pp.547-556, 2004 https://doi.org/10.1093/bioinformatics/btg447
  15. H. Shatkay, A. Hoglund, S. Brady, T. Blum, P. Donnes and O. Kohlbacher, “SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data,” Bioinformatics, Vol.23, pp.1410-1417, 2007 https://doi.org/10.1093/bioinformatics/btm115
  16. D. Lin, “Dependency-based evaluation of MINIPAR,” Workshop on the Evaluation of Parsing Systems, 1998
  17. J. Chen, G. He, Y. Wu and S. Jiang, “UNT at TREC 2004: Question Answering Combining Multiple Evidences,” Proc. of TREC. 2004
  18. C. Lee, G. G. Lee and M. Jang, “Dependency structure language model for topic detection and tracking,” Information Processing and Management, Vol.3, No.5, pp.1249-1259, 2007 https://doi.org/10.1016/j.ipm.2006.02.007
  19. R. Higashinaka, R. Prasad and M. Walker, “Learning to Generate Naturalistic Utterances Using Reviews in Spoken Dialogue Systems,” Proc. of COLING/ACL, pp.265-272, 2006
  20. D. Martínez, E. Agirre and L. Màrquez, “Syntactic features for high precision word sense disambiguation,” Proc. of the 19th international conference on Computational linguistics, pp.1-7, 2002 https://doi.org/10.3115/1072228.1072340
  21. R. Mihalcea and D. Moldovan, “Document Indexing Using Named Entities,” Studies in Informatics and Control, Vol.10, No.1, 2001
  22. 김미영, “구문관계에 기반한 유전자 상호작용 인식”, 정보처리학회논문지, Vol.14-B, No.5, pp.383-390, 2007 https://doi.org/10.3745/KIPSTB.2007.14-B.5.383
  23. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. J. Miller, “Introduction to WordNet: An on-line lexical database,” International Journal of Lexicography, Vol.3, No.4, pp.235-244, 1990 https://doi.org/10.1093/ijl/3.4.235
  24. 최종우, 한상태, 강현철, 김은석, 김미경, 이성건, “SAS Enterprise Miner 4.0을 이용한 데이터 마이닝 기능과 사용법”, 자유아카데미, 2001