DOI QR코드

DOI QR Code

On the Analysis of Natural Language Processing Morphology for the Specialized Corpus in the Railway Domain

  • Won, Jong Un (Artificial Intelligence Railroad Research Department, Korea Railroad Research Institute) ;
  • Jeon, Hong Kyu (Artificial Intelligence Railroad Research Department, Korea Railroad Research Institute) ;
  • Kim, Min Joong (Department of Systems Engineering, Ajou University) ;
  • Kim, Beak Hyun (Artificial Intelligence Railroad Research Department, Korea Railroad Research Institute) ;
  • Kim, Young Min (Department of Systems Engineering, Ajou University)
  • Received : 2022.10.08
  • Accepted : 2022.10.12
  • Published : 2022.11.30

Abstract

Today, we are exposed to various text-based media such as newspapers, Internet articles, and SNS, and the amount of text data we encounter has increased exponentially due to the recent availability of Internet access using mobile devices such as smartphones. Collecting useful information from a lot of text information is called text analysis, and in order to extract information, it is performed using technologies such as Natural Language Processing (NLP) for processing natural language with the recent development of artificial intelligence. For this purpose, a morpheme analyzer based on everyday language has been disclosed and is being used. Pre-learning language models, which can acquire natural language knowledge through unsupervised learning based on large numbers of corpus, are a very common factor in natural language processing recently, but conventional morpheme analysts are limited in their use in specialized fields. In this paper, as a preliminary work to develop a natural language analysis language model specialized in the railway field, the procedure for construction a corpus specialized in the railway field is presented.

Keywords

Acknowledgement

This study was supported by a grant from "Development of an artificial intelligence support platform for the development of intelligent railway and transportation technologies" of the Korea Railroad Research Institute's major project (PK2201C1).

References

  1. D. W. Otter, J. R. Medina, and J. K. Kalita, "A survey of the usages of deep learning for natural language processing," IEEE transactions on neural networks and learning systems, 32(2), pp. 604-624, 2020. DOI: https://doi.org/10.1109/TNNLS.2020.2979670
  2. S. Bird, "NLTK: the natural language toolkit" in Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp. 69-72, 2006. DOI: https://doi.org/10.48550/arXiv.cs/0205028
  3. E. L. Park, and S. Cho, "KoNLPy:Korean natural language processing in Python," in Proceedings of the 26th Annual Conference on Human &Cognitive Language Technology, pp. 133-136, 2014.
  4. A. Aizawa, "An information-theoretic perspective of tf-idf measures," Information Processing & Management, 39(1), pp. 45-65, 2003. DOI: https://doi.org/10.1016/S0306-4573(02)00021-3
  5. J. Qiang, P. Chen, T. Wang, and X. Wu, "Topic modeling over short texts by incorporating word embeddings," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Cham. pp. 363-374, 2017. DOI: https://doi.org/10.48550/arXiv.1609.08496
  6. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. DOI: https://doi.org/10.48550/arXiv.1810.04805
  7. S. Gururangan, A. Marasovic, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, "Don't stop pretraining: adapt language models to domains and tasks," arXiv preprint arXiv:2004.10964, 2020. DOI: https://doi.org/10.48550/arXiv.2004.10964
  8. L. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, "BioBERT: a pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, Volume 36, Issue 4, pp. 1234-1240, 2020. DOI: https://doi.org/10.1093/bioinformatics/btz682
  9. I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos, "LEGAL-BERT: The Muppets straight out of Law School," in Findings of the Association for Computational Linguistics: EMNLP, 2898 .2904, 2020. DOI: https://doi.org/10.48550/arXiv.2010.02559
  10. I. Beltagy, K. Lo, and A. Cohan, "SciBERT: A Pretrained Language Model for Scientific Text," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3615.3620, 2019. DOI: https://doi.org/10.48550/arXiv.1903.10676
  11. D. Kim, D. Lee, J. Park, S. Oh, S. Kwon, I. Lee, and D. Choi, "KB-BERT: Trainning and Appliocation of Korea Pre-trained Language Model in Financial Domain,", Journal of Intelligence and Information Systems, Vol. 28, No. 2, pp. 191-206, 2022. DOI: https://dx.doi.org/10.13088/jiis.2022.28.2.191
  12. C.W. Park, and J.H. Song, "A Study on the Establishment of an Annotation System for Text-Based Cultural Heritage,", Journal of the Korea Academia-Industrial cooperation Society, Vol. 22, No. 11, pp. 754-759, 2021. DOI: http://doi.org/10.5762/KAIS.2021.22.11.754