DOI QR코드

DOI QR Code

Design and Implementation of Event-driven Real-time Web Crawler to Maintain Reliability

신뢰성 유지를 위한 이벤트 기반 실시간 웹크롤러의 설계 및 구현

  • Received : 2022.01.03
  • Accepted : 2022.04.20
  • Published : 2022.04.28

Abstract

Real-time systems using web cralwing data must provide users with data from the same database as remote data. To do this, the web crawler repeatedly sends HTTP(HtypeText Transfer Protocol) requests to the remote server to see if the remote data has changed. This process causes network load on the crawling server and remote server, causing problems such as excessive traffic generation. To solve this problem, in this paper, based on user events, we propose a real-time web crawling technique that can reduce the overload of the network while securing the reliability of maintaining the sameness between the data of the crawling server and data from multiple remote locations. The proposed method performs a crawling process based on an event that requests unit data and list data. The results show that the proposed method can reduce the overhead of network traffic in existing web crawlers and secure data reliability. In the future, research on the convergence of event-based crawling and time-based crawling is required.

웹 크롤링 데이터를 이용한 실시간 시스템은 원격지의 데이터와 동일한 데이터베이스의 데이터를 사용자에게 제공해야 하며, 이를 위해서 웹 크롤러는 원격지 데이터의 변경 여부를 확인하기 위해 원격 서버에 반복적인 HTTP(HyperText Transfer Protocol) 요청을 수행해야 한다. 이 과정은 크롤링 서버와 원격 서버의 네트워크 부하를 일으키며 과도한 트래픽 발생 등의 문제의 원인이 된다. 이러한 문제점을 해결하기 위해 본 논문에서는 사용자 이벤트를 기반으로 크롤링 서버의 데이터와 다중 원격지 데이터와의 동일성을 유지하는 신뢰성을 확보함과 동시에 네트워크의 과부하를 줄일 수 있는 실시간 웹 크롤링 기법을 제안한다. 제안된 방법은 단위 데이터와 목록 데이터를 요청하는 이벤트를 기반으로 크롤링 프로세스를 수행한다. 실험 결과, 제안된 방법은 기존 웹 크롤러에서의 네크워크 트래픽 과부하를 줄이면서 데이터의 신뢰성을 확보할 수 있음을 확인하였다. 향후에는 이벤트 기반 크롤링과 시간 기반 크롤링에 대한 융합에 대한 연구가 필요하다.

Keywords

Acknowledgement

This reserch was supported by the Academic Research Fund of Hoseo University in 2021(No.20210434).

References

  1. S. C. Moon & S. C. Noh. (2019). A Study of Quality-based Software Architecture Design Model under Web Application Development Environment. Journal of Information Security, 12(4), 115-122.
  2. S. Y. Choo, Y. S. Hwang & S. J. Lee. (2021). Methods for Collecting Harmful Websites Using Web Crawling. Journal of Digital Forensics, 15(3), 127-138. https://doi.org/10.22798/KDFS.2021.15.3.127
  3. J. H. Kim & E. G. Kim. (2021). HTML Text Extraction Using Tag Path and Text Appearance Frequency. Journal of the Korea Institute of Information and Communication Engineering, 25(12), 1709-1715. https://doi.org/10.6109/JKIICE.2021.25.12.1709
  4. E. M. Park & J. H. Seo. (2019). A Study on Leadership Typology in Sports Leaders Based on Big Data Analysis. Journal of The Korean Convergence Society, 10(7), 191-198.
  5. J. R. Paik. (2018). Classification of Web Search Engines and Necessity of a Hybrid Search Engine. Journal of Digital Contents Society, 19(4), 719-729. https://doi.org/10.9728/DCS.2018.19.4.719
  6. S. Y. Choi, A. S. Matteson & H. S. Lim. (2018). Utilizing local bilingual embeddings on Korean-English law data. Journal of the Korea Convergence Society, 9(10), 45-53. https://doi.org/10.15207/JKCS.2018.9.10.045
  7. B. J. Jeon, K. H. Han & S. S. Shin. (2018). Door-Lock System to Detect and Transmit in Real Time according to External Shock Sensitivity. Journal of the Korea Convergence Society, 9(7), 9-16. https://doi.org/10.15207/JKCS.2018.9.7.009
  8. J. H. Choi, J. S. Park & M. S. Kim. (2014). Processing speed improvement of HTTP traffic classification based on hierarchical structure of signature. The Journal of Korean Institute of Communications and Information Sciences, 39(4), 191-199.
  9. Y. A. Kim, G. H. Kim, H. J. Kim & C. G. Kim. (2019). Design and Implemention of Real-time web Crawling distributed monitoring system. Journal of Convergence for Information Technology, 9(1), 45-53. https://doi.org/10.22156/CS4SMB.2019.9.1.045
  10. H. J. Kim, J. Y. Lee & S. S. Shin. (2017). Multi-threaded Web Crawling Design using Queues. Journal of Convergence for Information Technology, 7(2), 43-51. https://doi.org/10.22156/CS4SMB.2017.7.2.043
  11. D. H. Han & Y. K. Lee. (2021). Design of action-based Web crawler structural configuration for multi-website management. KIISE Transactions on Computing Practices, 27(2), 98-103. https://doi.org/10.5626/KTCP.2021.27.2.98
  12. M. Y. Park, C. Y. Park & C. S. Lee. (2019). Performance comparison of Spring Framework and Node.js Framework(NestJS) in microservice. Proceedings of The Korean Institute of Information Scientists and Engineers Conference, (pp.287-289).
  13. J. Y. Kim, H. S. Kim, C. Y. Jin, Y. M. Hwang, S. R. Kim & B. M. Kim. (2021). Implementation of Web-based Project Management System. Proceeding of Korean Institute of Information Technoloy Conference, (pp. 556-559).
  14. S. J. Kwon. (2017). A Study on the Server Framework for Multi-platform Simulation Network Game. Journal of Korea Game Society, 17(6), 165-171. https://doi.org/10.7583/JKGS.2017.17.6.165
  15. 3rd Party Promise module. http://bluebirdjs.com/docs/getting-started.html