DOI QR코드

DOI QR Code

Probabilistic filtering for a biological knowledge discovery system with text mining and automatic inference

텍스트 마이닝 및 자동 추론 기반 생물학 지식 발견 시스템을 위한 확률 기반 필터링

  • Received : 2011.11.08
  • Accepted : 2011.11.22
  • Published : 2012.02.29

Abstract

In this paper, we discuss the structure of biological knowledge discovery system based on text mining and automatic inference. Given a set of biology documents, the system produces a new hypothesis in an integrated manner. The text mining module of the system first extracts the 'event' information of predefined types from the documents. The inference module then produces a new hypothesis based on the extracted results. Such an integrated system can use information more up-to-date and diverse than other automatic knowledge discovery systems use. However, for the success of such an integrated system, the precision of the text mining module becomes crucial, as any hypothesis based on a single piece of false positive information would highly likely be erroneous. In this paper, we propose a probabilistic filtering method that filters out false positives from the extraction results. Our proposed method shows higher performance over an occurrence-based baseline method.

본 논문에서는 텍스트 마이닝을 통해 생물학 문헌에서 분자 수준의 사건(event) 정보를 자동으로 추출하고, 이들 사건 정보를 기반으로 새로운 생물학 지식을 자동 추론하는 텍스트 마이닝 - 추론 통합 구조의 시스템을 다룬다. 이러한 통합 구조의 지식 발견 시스템은 미리 추출되어 데이터베이스에 등록된 정보만을 입력으로 사용하는 시스템들에 비하여 최신 정보를 보다 빨리 사용할 수 있고, 미리 정의된 형식 이외의 다양한 정보를 사용할 수 있다는 장점이 있다. 반면, 텍스트 마이닝 정보 추출 결과를 그대로 사용하기 때문에 텍스트 마이닝 모듈(module)의 성능에 따라 전체 시스템의 효용성이 크게 저하될 수도 있다는 문제가 있다. 본 논문에서는 확률 기반 필터링(filtering) 방법을 제안하여, 텍스트 마이닝 결과 중 양성 오류(false positive)를 효과적으로 제거함으로써 전체 지식 발견 시스템의 정확도 및 효용성을 높이고자 한다. 본 논문에서 제안한 확률 기반 필터링 방법은 기준(baseline) 방법으로 사용된 횟수 기반 필터링 방법보다 높은 성능을 보였다.

Keywords

References

  1. P.Zweigenbaum and D.Demner-Fushman, Advanced literature-mining tools, In J.E.Stajich, D.Edwards and D.Hansen, eitors, "Bioinformatics: Tools and Applications," pp.347-381, Springer, Sep. 2009.
  2. E.Antezana, M.Kuiper, and V.Mironovm, "Biological knowledge management: the emerging role of the semantic web technologies," Briefings in Bioinformatics, Vol. 10, No. 4, pp.392-407, May 2009. https://doi.org/10.1093/bib/bbp024
  3. T.Slater, C.Bouton, and E.S.Huang, "Beyond data integration," Drug Discovery Today, Vol. 13, No. 1314, pp.584-589, March 2008. https://doi.org/10.1016/j.drudis.2008.01.008
  4. Q.Zhu, Y.Sun, S.Challa, Y.Ding, M.Lajiness, and D.Wild, "Semantic inference using chemogenomics data for drug discovery," BMC Bioinformatics, Vol. 12, No. 1, pp.256, June 2011. https://doi.org/10.1186/1471-2105-12-256
  5. C.B.Giles and J.D.Wren, "Large scale directional relationship extraction and resolution," BMC Bioinformatics, Vol. 9, No. suppl 9, pp.S11, Aug. 2008. https://doi.org/10.1186/1471-2105-9-S9-S11
  6. D.R.Swanson, "Two medical literatures that are logically but not bibliographically connected," Journal of the American Society for Information Science, Vol. 38, No. 4, pp.228-233, July 1987. https://doi.org/10.1002/(SICI)1097-4571(198707)38:4<228::AID-ASI2>3.0.CO;2-G
  7. D.R.Swanson, "Complementary structures in disjoint science literatures," In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, Oct. 1991.
  8. D.R.Swanson, and N.R.Smalheiser, "An interactive system for finding complementary literatures: a stimulus to scientific discovery," Artif. Intell., Vol. 91, No. 2, pp.183--203, April 1997. https://doi.org/10.1016/S0004-3702(97)00008-8
  9. K.Seiki and J.Mostafa, "Discovering implicit associations between gens and hereditary diseases," In Proceedings of the Pacific Symposium on Biocomputing 2007, Jan. 2007.
  10. M.Yetisgen-Yildiz and W.Pratt, "Using statistical and knowledge based approaches for literature based discovery," Journal of Biomedical Informatics, Vol. 39, No. 6, pp.600-611, Jan. 2006. https://doi.org/10.1016/j.jbi.2005.11.010
  11. D.Hristovski, C.Friedman, T.C.Rindflesch, and B.Peterlin, "Exploiting semantic relations for literature based discovery," In AMIA Annual Symposium Proceedings, Nov. 2006.
  12. L.Tari, S.Anwar, S.Liang, J.Cai, and C.Baral, "Discovering drug drug interactions: a text mining and reasoning approach based on properties of drug metabolism," Bioinformatics, Vol. 26, No. 18, pp.i547-i553, Sep. 2010. https://doi.org/10.1093/bioinformatics/btq382
  13. J.D.Kim, S.Kraines, W.Guo, and J.Tsujii. "Inference for bioie: Genia meets ekoss," In Proceedings of the 3rd International Symposium on Language in Biology and Medicine, Nov. 2009.
  14. H.J.Lee and J.C.Park, "Towards Knowledge Discovery through Automatic Inference with Text Mining in Biology and Medicine," In Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, Sep. 2008.
  15. J.Bjorne, F.Ginter, J.Heimonen, A.Airola, T.Pahikkala and T.Salakoski, "Extracting Complex Biological Events with Rich Graph-Based Features Sets," In Proceedings of the BioNLP'09 Shared Task on Event Extraction, pp.10-18, June 2009.
  16. A.Cimatti et al., "NuSMV 2: An opensource tool for symbolic model checking," In Proceedings of CAV 2002, pp.27-31. July 2002.
  17. J.D. Kim, S.Pyysalo, T.Ohta, R.Bossy, N.Nguyen and J.Tsujii, "Overview of BioNLP Shared Task 2011," In Proceedings of BioNLP Shared Task 2011 Workshop, pp. 1-6, June 2011.
  18. S.Povey, R.Lovering, E.Bruford, M.Wright, M.Lush and He.Wain, "The HUGO Gene Nomenclature Committee (HGNC)," Human Genetics Vol. 109, No. 6, pp.678-680, Oct. 2001. https://doi.org/10.1007/s00439-001-0615-0
  19. S.Leem, K.Wee, "Prediction of SNP interactions in complex diseases with mutual information and boolean algebra," Journal of The Korea Society of Computer and Information, Vol.15, No.11, pp.215-224, Nov. 2010. https://doi.org/10.9708/jksci.2010.15.11.215
  20. H.Jeong, Y.Yoon, "Class prediction of an indepen dent sample using a set of gene modules consisting of gene-pairs which were condition(Tumor, Normal) specific," Journal of The Korea Society of Computer and Information, Vol.15, No.12, pp.197-207, Dec. 2010. https://doi.org/10.9708/jksci.2010.15.12.197

Cited by

  1. Mixture of Expert 모형에 기반한 당뇨병 진단 분류 vol.19, pp.11, 2012, https://doi.org/10.9708/jksci.2014.19.11.149