Advanced SearchSearch Tips
Protein Named Entity Identification Based on Probabilistic Features Derived from GENIA Corpus and Medical Text on the Web
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
Protein Named Entity Identification Based on Probabilistic Features Derived from GENIA Corpus and Medical Text on the Web
Sumathipala, Sagara; Yamada, Koichi; Unehara, Muneyuki; Suzuki, Izumi;
  PDF(new window)
Protein named entity identification is one of the most essential and fundamental predecessor for extracting information about protein-protein interactions from biomedical literature. In this paper, we explore the use of abstracts of biomedical literature in MEDLINE for protein name identification and present the results of the conducted experiments. We present a robust and effective approach to classify biomedical named entities into protein and non-protein classes, based on a rich set of features: orthographic, keyword, morphological and newly introduced Protein-Score features. Our procedure shows significant performance in the experiments on GENIA corpus using Random Forest, achieving the highest values of precision 92.7%, recall 91.7%, and F-measure 92.2% for protein identification, while reducing the training and testing time significantly.
biomedical text mining;named entity recognition;protein named entity;random forest;
 Cited by
MEDLINEⓇ/ PubMedⓇ/ Resources Guide, ""

Bui, Q. C., Katrenko, S., and Sloot, P. M. “A hybrid approach to extract protein-protein interactions.” Bioinformatics 27, no. 2 (2011): 259-265. crossref(new window)

Blaschke, C., Andrade, M. A., Ouzounis, C. A., and Valencia, A. “Automatic extraction of biological information from scientific text: protein-protein interactions.” In Ismb, vol. 7, pp. 60-67. 1999.

UniProtKB, ""

Ratinov, L., and Roth, D. “Design challenges and misconceptions in named entity recognition.” In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 147-155. Association for Computational Linguistics, 2009.

Sundheim, B. M. “Overview of results of the MUC-6 evaluation.” In Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996, pp. 423-442. Association for Computational Linguistics, 1996.

Tanabe, L., Xie, N., Thom, L. H., Matten, W., and Wilbur, W. J. “GENETAG: a tagged corpus for gene/protein named entity recognition.” BMC bioinformatics 6, no. Suppl 1(2005): S3.

Krauthammer, M., Rzhetsky, A., Morozov, P., and Friedman, C. “Using BLAST for identifying gene and protein names in journal articles.” Gene 259, no. 1 (2000): 245-252. crossref(new window)

Seki, K., and Mostafa, J. (2005). “A hybrid approach to protein name identification in biomedical texts”. Information processing and management, 41(4), 723-743. crossref(new window)

Zhou, G., Shen, D., Zhang, J., Su, J., and Tan, S. “Recognition of protein/gene names from text using an ensemble of classifiers.” BMC bioinformatics 6, no. Suppl 1 (2005): S7.

Finkel, J., Dingare, S., Manning, C. D., Nissim, M., Alex, B., and Grover, C. “Exploring the boundaries: gene and protein identification in biomedical text.” BMC bioinformatics 6, no. Suppl 1 (2005): S5.

Mitsumori, T., Fation, S., Murata, M., Doi, K., and Doi, H. “Gene/protein name recognition based on support vector machine using dictionary as features.” BMC bioinformatics 6, no. Suppl 1 (2005).

Ju, Z., Wang, J., and Zhu, F. (2011, May). “Named entity recognition from biomedical text using SVM”. In Bioinformatics and Biomedical Engineering,(iCBBE) 2011 5th International Conference on (pp. 1-4). IEEE.

Yang, Li, and Yanhong Zhou. “Exploring feature sets for two-phase biomedical named entity recognition using semiCRFs. ”Knowledge and Information Systems (2013): 1-15.

Li, L., Zhou, R., and Huang, D. “Two-phase biomedical named entity recognition using CRFs.” Computational biology and chemistry 33, no. 4 (2009): 334-338. crossref(new window)

Lin, Y. F., Tsai, T. H., Chou, W. C., Wu, K. P., Sung, T. Y., and Hsu, W. L. “A maximum entropy approach to biomedical named entity recognition. ” In BIOKDD, pp. 56-61. 2004.

Zhang, S., and Elhadad, N. “Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts.” Journal of biomedical informatics 46, no. 6 (2013): 1088-1098. crossref(new window)

Breiman, L. “Random forests.” Machine learning,(2001), 45:5-32. crossref(new window)

Sumathipala, S., Yamada, K., and Unehara, M. “Protein Named Entity Classification with Probabilistic Features Derived from GENIA Corpus and MEDLINE”, Joint 7th International Conference on Soft Computing and Intelligent Systems and 15th International Symposium on Advanced Intelligent Systems (2014): 1257-1261, Japan

Kuo, H. C., and Lin, K. I. “Extracting Protein Names from Biological Literature.” Advances in Computer Science: an International Journal 3, no. 2 (2014): 58-68.

Tatar, S., and Cicekli, I. “Two learning approaches for protein name extraction.” Journal of biomedical informatics 42, no. 6 (2009): 1046-1055. crossref(new window)

Patrick, J., and Wang, Y. “Biomedical named entity recognition system.” In Proceedings of the Tenth Australasian Document Computing Symposium (ADCS 2005). 2005.

Zhou, G., Zhang, J., Su, J., Shen, D., and Tan, C.“Recognizing names in biomedical texts: a machine learning approach.” Bioinformatics 20, no. 7 (2004): 1178-1190. crossref(new window)

Liu, X., Zhang, S., Wei, F., and Zhou, M. “Recognizing named entities in tweets.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1, pp. 359-367. Association for Computational Linguistics, 2011.

Chieu, H. L., and Ng, H. T.“Named entity recognition: a maximum entropy approach using global information.” In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pp. 1-7. Association for Computational Linguistics, 2002.

Witten, I. H., and Frank, E."Data Mining: Practical machine learning tools and techniques." Morgan Kaufmann, 2005.

PubMed Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2005-. PubMed Help. [Updated 2014 Mar 25],""

Chen, X., and Ishwaran, H. (2012). “Random forests for genomic data analysis”. Genomics, 99(6), 323-329. crossref(new window)

Boulesteix, A. L., Janitza, S., Kruppa, J., and Knig, I. R. (2012). “Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493-507. crossref(new window)

Okun, O., and Priisalu, H. (2007). “Random forest for gene expression based cancer classification: overlooked issues”. In Pattern Recognition and Image Analysis (pp. 483-490). Springer Berlin Heidelberg.

Yang, P., Hwa Yang, Y., B Zhou, B., and Y Zomaya, A. (2010). “A review of ensemble methods in bioinformatics”. Current Bioinformatics, 5(4), 296-308. crossref(new window)

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). "Classification and regression trees". CRC press.

Breiman, L. (1996). “Bagging predictors”. Machine learning, 24(2), 123-140.

Zhu, F., and Shen, B. “Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.” PloS one 7, no. 6 (2012): e39230. crossref(new window)

Kazama, J. I., Makino, T., Ohta, Y., and Tsujii, J. I. “Tuning support vector machines for biomedical named entity recognition.” In Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain- Volume 3, pp. 1-8. Association for Computational Linguistics, 2002.

Lee, K. J., Hwang, Y. S., Kim, S., and Rim, H. C. “Biomedical named entity recognition using two-phase model based on SVMs.” Journal of Biomedical Informatics 37, no. 6 (2004): 436-447. crossref(new window)