DOI QR코드

DOI QR Code

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

  • Kang, In-Su (Dept. of Computer Science, Kyungsung University)
  • Received : 2018.03.28
  • Accepted : 2018.05.16
  • Published : 2018.05.31

Abstract

Reference string recognition is to extract individual reference strings from a reference section of an academic article, which consists of a sequence of reference lines. This task has been attacked by heuristic-based, clustering-based, classification-based approaches, exploiting lexical and layout characteristics of reference lines. Most classification-based methods have used sequence labeling to assign labels to either a sequence of tokens within reference lines, or a sequence of reference lines. Unlike the previous token-level sequence labeling approach, this study attempts to assign different labels to the beginning, intermediate and terminating tokens of a reference string. After that, post-processing is applied to identify reference strings by predicting their beginning and/or terminating tokens. Experimental evaluation using English and German reference string recognition datasets shows that the proposed method obtains above 94% in the macro-averaged F1.

Keywords

References

  1. I. Councill, C. Giles, and M.-Y. Kan, "ParsCit: an Open-source CRF Reference String Parsing Package," Proceedings of the 6th International Conference on Language Resources and Evaluation(LREC), 2008.
  2. R. Kern, and S. Klampfl, "Extraction of References Using Layout and Formatting Information from Scientific Articles," D-Lib Magazine, Vol. 19, No. 9/10, September/October, 2013.
  3. D. Tkaczyk, "New Methods for Metadata Extraction from Scientific Literature," PhD Thesis, ICM, University of Warsaw, 2015.
  4. M. Korner, B. Ghavimi, P. Mayr, H. Hartmann, and S. Staab, "Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications," M. Kirikova et al. (Eds.): ADBIS 2017, CCIS 767, pp. 137-145, 2017.
  5. J. Boyd, "Automatic Metadata Extraction The High Energy Physics Use Case," Master's Thesis, CERN-THESIS-2015-105, 2015.
  6. Pdfextract, https://www.crossref.org/labs/pdfextract/
  7. D. Tkaczyk, P. Szostek, M. Fedoryszak, P. Dendek, and L. Bolikowski, "CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature," International Journal on Document Analysis and Recognition(IJDAR), Vol. 18, No. 4, pp. 317-335, December, 2015. https://doi.org/10.1007/s10032-015-0249-8
  8. P. Lopez, "GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications," Proceedings of the 13th European Conference on Digital Libraries(ECDL), pp. 473-474, 2009.
  9. A. Bhardwaj, D. Mercier, A. Dengel, and S. Ahmed, "DeepBIBX: Deep Learning for Image Based Bibliographic Data Extraction," D. Liu et al. (Eds.): ICONIP 2017, Part II, LNCS 10635, pp. 286-293, 2017.
  10. J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proceedings of the 18th International Conference on Machine Learning(ICML), pp. 282-289, 2001.
  11. S. Bird, R. Dale, B. Dorr, B. Gibson, M. Joseph, M.-Y. Kan, D. Lee, B. Powley, D. Radev, and Y. Tan, "The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics," Proceedings of the 6th International Conference on Language Resources and Evaluation(LREC), 2008.
  12. S. Anzaroot, and A. McCallum, "A New Dataset for Fine-grained Citation Field Extraction," Proceedings of the ICML Workshop on Peer Reviewing and Publishing Models, 2013.
  13. US Census Bureau, "Frequently Occurring Surnames from the 2010 Census", https://www.census.gov/topics/population/genealogy/data/2010_surnames.html, 2010.
  14. Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 10 Aug. 2004.
  15. CRF++: Yet Another CRF toolkit, https://taku910.github.io/crfpp/
  16. DBLP, https://dblp.uni-trier.de/