Automatic Extraction of References for Research Reports using Deep Learning Language Model

Yukyung Han;Wonsuk Choi;Minchul Lee;

doi:10.3743/KOSIM.2023.40.2.115

정보관리학회지 (Journal of the Korean Society for information Management)

제40권2호
/
Pages.115-135
/
2023
/
1013-0799(pISSN)
/
2586-2073(eISSN)

한국정보관리학회 (Korean Society for Information Management)

DOI QR Code

딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구

Automatic Extraction of References for Research Reports using Deep Learning Language Model

한유경 (정보통신정책연구원) ;
최원석 (정보통신정책연구원) ;
이민철 (카카오엔터프라이즈)

투고 : 2023.05.15
심사 : 2023.06.10
발행 : 2023.06.30

https://doi.org/10.3743/KOSIM.2023.40.2.115 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

본 연구는 단행본, 학술지, 보고서 등 다양한 종류의 발간물로 구성된 연구보고서의 참고문헌 데이터베이스를 효율적으로 구축하기 위한 것으로 딥러닝 언어 모델을 이용하여 참고문헌의 자동추출 성능을 비교 분석하고자 한다. 연구보고서는 학술지와는 다르게 기관마다 양식이 상이하여 참고문헌 자동추출에 어려움이 있다. 본 연구에서는 참고문헌 자동추출에 널리 사용되는 연구인 메타데이터 추출과 더불어 참고문헌과 참고문헌이 아닌 문구가 섞여 있는 환경에서 참고문헌만을 분리해내는 원문 분리 연구를 통해 이 문제를 해결하였다. 자동 추출 모델을 구축하기 위해 특정 연구기관의 연구보고서 내 참고문헌셋, 학술지 유형의 참고문헌셋, 학술지 참고문헌과 비참고문헌 문구를 병합한 데이터셋을 구성했고, 딥러닝 언어 모델인 RoBERTa+CRF와 ChatGPT를 학습시켜 메타데이터 추출과 자료유형 구분 및 원문 분리 성능을 측정하였다. 그 결과 F1-score 기준 메타데이터 추출 최대 95.41%, 자료유형 구분 및 원문 분리 최대 98.91% 성능을 달성하는 등 유의미한 결과를 얻었다. 이를 통해 비참고문헌 문구가 포함된 연구보고서의 참고문헌 추출에 대한 딥러닝 언어 모델과 데이터셋 유형별 참고문헌 구축 방향을 제안하였다.

The purpose of this study is to assess the effectiveness of using deep learning language models to extract references automatically and create a reference database for research reports in an efficient manner. Unlike academic journals, research reports present difficulties in automatically extracting references due to variations in formatting across institutions. In this study, we addressed this issue by introducing the task of separating references from non-reference phrases, in addition to the commonly used metadata extraction task for reference extraction. The study employed datasets that included various types of references, such as those from research reports of a particular institution, academic journals, and a combination of academic journal references and non-reference texts. Two deep learning language models, namely RoBERTa+CRF and ChatGPT, were compared to evaluate their performance in automatic extraction. They were used to extract metadata, categorize data types, and separate original text. The research findings showed that the deep learning language models were highly effective, achieving maximum F1-scores of 95.41% for metadata extraction and 98.91% for categorization of data types and separation of the original text. These results provide valuable insights into the use of deep learning language models and different types of datasets for constructing reference databases for research reports including both reference and non-reference texts.

키워드

과제정보

본 연구는 정보통신정책연구원 2023년도 정보자료 운영사업의 지원을 받아 수행되었음.

참고문헌

Ji, Seon-yeong & Choi, Sung-pil (2021). A study on recognition of citation metadata using bidirectional GRU-CRF model based on pre-trained language model. Journal of the Korean Society for information Management, 38(1), 221-242. https://doi.org/10.3743/KOSIM.2021.38.1.221
Lee, Kangsandajeong, Lee, Hyejin, & Hyun, Mihwan (2022). A study on national r&d report reference technological improvement. Journal of the Korea Convergence Society, 13(1), 31-42. https://doi.org/10.15207/JKCS.2022.13.01.031
Besagni, D., Belaid, A., & Benet, N. (2003). A segmentation method for bibliographic references by contextual tagging of fields. Seventh International Conference on Document Analysis and Recognition, 384-388. https://doi.org/10.1109/ICDAR.2003.1227694
Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377. https://doi.org/10.1002/asi.20317
Choi, W., Yoon, H. M., Hyun, M. H., Lee, H. J., Seol, J. W., Lee, K. D., Yoon, Y. J., & Kong, H. (2023). Building an annotated corpus for automatic metadata extraction from multilingual journal article references. PloS one, 18(1), e0280637. https://doi.org/10.1371/journal.pone.0280637
Councill, I., Giles, C., & Kan, M. (2008). ParsCit: an Open-source CRF Reference String Parsing Package. LREC, 8, 661-667.
Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., & Bai, X. (2019). Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. 2019 12th international congress on image and signal processing, biomedical engineering and informatics, 1-5. https://doi.org/10.1109/CISP-BMEI48845.2019.8965823
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
Fritzler, A., Logacheva, V., & Kretov, M. (2019). Few-shot classification in named entity recognition task. Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 993-1000. https://doi.org/10.1145/3297280.3297378
Gonzalez-Gallardo, C., Boros, E., Girdhar, N., Hamdi, A., Moreno, J., & Doucet, A. (2023). Yes but.. Can ChatGPT Identify Entities in Historical Documents?. https://doi.org/10.48550/arXiv.2303.17322
Hetzner, E. (2008). A simple method for citation metadata extraction using hidden markov models. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, 280-284. https://doi.org/10.1145/1378889.1378937
Hollingsworth, B., Lewin, I., & Tidhar, D. (2005). Retrieving hierarchical text structure from typeset scientific articles: a prerequisite for e-science text mining. Proc. of the 4th UK E-Science All Hands Meeting, 67-273.
Hu, Y., Ameer, I., Zuo, X., Peng, X., Zhou, Y., Li, Z., Li, Y., Li, J., Jiang, X., & Xu, H. (2023). Zero-shot Clinical Entity Recognition using ChatGPT. https://doi.org/10.48550/arXiv.2303.16416
Huang, I.., Ho, J., Kao, H., & Lin, W. (2004). Extracting citation metadata from online publication lists using BLAST. Advances in Knowledge Discovery and Data Mining: 8th Pacific-Asia Conference, 539-548. https://doi.org/10.1007/978-3-540-24775-3_64
Kim, J., Choi, N., Lim, S., Kim, J., Chung, S., Woo, H., Song, M., & Choi, J. D. (2021). Analysis of Zero-Shot Crosslingual Learning between English and Korean for Named Entity Recognition. Proceedings of the 1st Workshop on Multilingual Representation Learning, 224-237. https://doi.org/10.18653/v1/2021.mrl-1.19
Korea Institute of Science and Technology Information (2022). DeepData-REFMETA Version 1.0. http://doi.org/10.23057/47
Lauscher, A., Ravishankar, V., Vulic, I., & Glavas, G. (2020). From zero to hero: on the limitations of zero-shot cross-lingual transfer with multilingual transformers. https://doi.org/10.48550/arXiv.2005.00633
Liu, X., Chen, H., & Xia, W. (2022). Overview of named entity recognition. Journal of Contemporary Educational Research, 6(5), 65-68. https://doi.org/10.26689/jcer.v6i5.3958
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. https://doi.org/10.48550/arXiv.1907.11692
Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. Research and Advanced Technology for Digital Libraries: 13th European Conference, 473-474. https://doi.org/10.1007/978-3-642-04346-8_62
OpenAI (2022). Introducing ChatGPT. Available: https://openai.com/blog/chatgpt/
Park, S., Moon, J., Kim, S., Cho, W. I., Han, J., Park, J., Song, C., Kim, J., Song, Y., Oh, T., Lee, J., Oh, J., Lyu, S., Jeong, Y., Lee, I., Seo, S., Lee, D., Kim, H., Lee, M., Jang, S., Do, S., Kim, S., Lim, K., Lee, J., Park, K., Shin, J., Kim, S., Park, L., Oh, A., Ha, J., & Cho, K. (2021). Klue: Korean Language Understanding Evaluation. https://doi.org/10.48550/arXiv.2105.09680
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
Rodrigues A. D., Colavizza, G., & Kaplan, F. (2018). Deep reference mining from scholarly literature in the arts and humanities. Frontiers in Research Metrics and Analytics, 21. https://doi.org/10.3389/frma.2018.00021
Segura-Bedmar, I., Martinez Fernandez, P., & Herrero-Zazo, M. (2013). Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics, 341-350.
Souza, F., Nogueira, R., & Lotufo, R. (2019). Portuguese named entity recognition using BERTCRF. https://doi.org/10.48550/arXiv.1909.10649
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L. (2015). CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18, 317-335. https://doi.org/10.1007/s10032-015-0249-8
Van Eck, N. & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523-538. https://doi.org/10.1007/s11192-009-0146-3
Voskuil, K. & Verberne, S. (2021). Improving reference mining in patents with BERT. https://doi.org/10.48550/arXiv.2101.01039
Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., & Wang, G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. https://doi.org/10.48550/arXiv.2304.10428
Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., Xie, P., Xu, J., Chen, Y., Zhang, M., Jiang, Y., & Han, W. (2023). Zero-shot information extraction via chatting with chatgpt. https://doi.org/10.48550/arXiv.2302.10205
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. https://doi.org/10.48550/arXiv.2302.11382
Wu, Y., Huang, J., Xu, C., Zheng, H., Zhang, L., & Wan, J. (2021). Research on named entity recognition of electronic medical records based on roberta and radical-level feature. Wireless Communications and Mobile Computing, 2021, 1-10. https://doi.org/10.1155/2021/2489754
Yang, Y. & Katiyar, A. (2020). Simple and effective few-shot named entity recognition with structured nearest neighbor learning. https://doi.org/10.48550/arXiv.2010.02405
Zhang, X., Zou, J., Le, D. X., & Thoma, G. R. (2011). A structural SVM approach for reference parsing. BMC bioinformatics, 12, 1-7. https://doi.org/10.1186/1471-2105-12-S3-S7

정보관리학회지 (Journal of the Korean Society for information Management)

딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구

Automatic Extraction of References for Research Reports using Deep Learning Language Model

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)