DOI QR코드

DOI QR Code

Techniques to Guarantee Real-Time Fault Recovery in Spark Streaming Based Cloud System

Spark Streaming 기반 클라우드 시스템에서 실시간 고장 복구를 지원하기 위한 기법들

  • 김정호 (서울대학교 융합과학부) ;
  • 박대동 (서울대학교 전기정보공학부) ;
  • 김상욱 (서울대학교 전기정보공학부) ;
  • 문용식 (서울대학교 전기정보공학부) ;
  • 홍성수 (서울대학교 전기정보공학부)
  • Received : 2016.11.07
  • Accepted : 2017.01.13
  • Published : 2017.05.15

Abstract

In a real-time cloud environment, the data analysis framework plays a pivotal role. Spark Streaming meets most real-time requirements among existing frameworks. However, the framework does not meet the second scale real-time fault recovery requirement. Spark Streaming fault recovery time increases in proportion to the transformation history length called lineage. This is because it recovers the last state data based on the cumulative lineage recorded during normal operation. Therefore, fault recovery time is not bounded within a limited time. In addition, it is impossible to achieve a second-scale fault recovery time because it costs tens of seconds to read initial state data from fault-tolerant storage. In this paper, we propose two techniques to solve the problems mentioned above. We apply the proposed techniques to Spark Streaming 1.6.2. Experimental results show that the fault recovery time is bounded and the average fault recovery time is reduced by up to 41.57%.

실시간 클라우드의 실현에 있어서 데이터 분석 프레임워크는 중추 역할을 수행한다. 현존하는 프레임워크들 중에 가장 많은 요구사항들을 충족하는 것은 Spark Streaming이다. 하지만 이 프레임워크는 초 단위 실시간 고장 복구를 충족하지 못하고 있다. Spark Streaming의 고장 복구 기법은 정상 동작시에 기록된 누적 변형 히스토리를 토대로 고장 직전 마지막 상태 데이터를 재연산하여 복구하기 때문에 히스토리의 길이에 비례하여 복구 시간이 증가된다. 따라서 제한된 시간 이내에 고장 복구가 완료됨을 보장되지 않는다. 또한 초기 상태 데이터를 고장 감내 스토리지에서 읽는 시간이 수십 초에 달하여 초 단위고장 복구 시간을 달성할 수 없다. 본 논문에서는 언급된 문제들을 해결하기 위한 두 가지 기법들을 제안한다. 이를 Spark Streaming 1.6.2에 적용하고, 실험을 통해 고장 복구 시간이 제한 시간 이내에 완료되며 평균 약 41.57% 단축됨을 확인했다.

Keywords

Acknowledgement

Supported by : 정보통신기술진흥센터

References

  1. Burgstahler Daniel, et al., "RemoteHorizon.KOM: Dynamic Cloud-Based eHorizon," Automotive meets Electronics (AmE 2016), 2016.
  2. Dean Jeffrey and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, 2008.
  3. Zaharia Matei, et al., "Discretized Streams: Fault-Tolerant Streaming Computation at Scale," Proceedings of ACM Symposium on Operating Systems Principles (SOSP), 2013.
  4. Toshniwal Ankit, et al., "Storm @twitter," Proceedings of ACM International Conference on Management of Data (SIGMOD), 2014.
  5. MN Cheraghlou, et al., "A Survey of Fault Tolerance Architecture in Cloud Computing," Journal of Network and Computer Applications, 2015.
  6. Wu Zhengping and Nailu Chu, "Efficient Service Re-Composition Using Semantic Augmentation for Fast Cloud Fault Recovery," IEEE International Conference on Services Computing (SCC), 2013.
  7. Qi Ping and Longshu Li, "A Fault Recovery-Based Scheduling Algorithm for Cloud Service Reliability," Security and Communication Networks, 2015.
  8. Pourvali Mahsa, et al., "Progressive Recovery for Cloud-Based Infrastructure Services," IEEE International Conference on Cloud Networking (Cloud-Net), 2015.
  9. Jhawar Ravi, Vincenzo Piuri and Marco Santambrogio, "Fault Tolerance Management in Cloud Computing: A System-Level Perspective," IEEE Systems Journal, 2013.
  10. Mohammed Bashir, et al., "Optimising Fault Tolerance in Real-Time Cloud Computing IaaS Environment," IEEE International Conference on Future Internet of Things and Cloud (FiCloud), 2016.
  11. Chen Gang, et al., "A Lightweight Software Fault-Tolerance System in The Cloud Environment," Concurrency and Computation: Practice and Experience, 2015.
  12. Malik Sheheryar and Fabrice Huet, "Adaptive Fault Tolerance in Real Time Cloud Computing," IEEE World Congress on Services (SERVICES), 2011.
  13. Kaur Jasbir and Supriya Kinger, "Analysis of Different Techniques Used for Fault Tolerance," International Journal of Computer Science and Information Technologies (IJCSIT), 2014.
  14. Amin Zeeshan, Harshpreet Singh and Nisha Sethi, "Review on Fault Tolerance Techniques in Cloud Computing," International Journal of Computer Applications, 2015.
  15. Zhao Wenbing, P. M. Melliar-Smith and Louise E. Moser, "Fault Tolerance Middleware for Cloud Computing," IEEE International Conference on Cloud Computing (CLOUD), 2010.