New Z-Cycle Detection Algorithm Using Communication Pattern Transformation for the Minimum Number of Forced Checkpoints

통신 유형 변형을 이용하여 검사점 생성 개수를 개선한 검사점 Z-Cycle 검출 기법

  • 우남윤 (서울대학교 컴퓨터공학부) ;
  • 염헌영 (서울대학교 컴퓨터공학부) ;
  • 박태순 (세종대학교 컴퓨터공학부)
  • Published : 2004.12.01

Abstract

Communication induced checkpointing (CIC) is one of the checkpointing techniques to provide fault tolerance for distributed systems. Independent checkpoints that each distributed process produces without coordination are likely to be useless. Useless checkpoints, which cannot belong to any consistent global checkpoint sets, induce nondeterminant rollback. To prevent the useless checkpoints, CIC forces processes to take additional checkpoints at proper moment. The number of those forced checkpoints is the main source of failure-free overhead in CIC. In this paper, we present two new CIC protocols which satisfy 'No Z-Cycle (NZC)'property. The proposed protocols reduce the number of forced checkpoints compared to the existing protocols with the drawback of the increase in message delay. Our simulation results with the synthetic data show that the proposed protocols have lower failure-free overhead than the existing protocols. Additionally, we show that the classical 'index-based checkpointing' protocols are inefficient in constructing the consistent global cut in distributed executions.

통신 유도 검사점 기법(communication induced checkpointing)은 분산 프로세스들의 결함 내성을 위한 검사점 기법 중 한 가지이다. 각 프로세스가 동기화를 거치지 않고 독립적으로 생성한 지역 검사점은 일관성을 위배하는 불필요한 검사점(useless checkpoint)이 될 가능성이 있으며, 연속적인 프로세스의 롤백(rollback)을 유발시킨다. 이를 막기 위해서 통신 유도 검사점 기법은 추가로 강제적인 검사점(forced checkpoint)을 생성한다. 강제적 검사점의 개수는 전체 시스템 성능의 부하와 직결되므로 이를 줄이는 것이 중요하다. 이 논문에서는 "Z-cycle 부재" 조건을 만족하는 두 가지의 통신 기반 검사점 기법을 제안하며, 시뮬레이션 결과를 통해서 제안된 알고리즘들이 기존의 알고리즘들보다 적은 부하를 요구함을 보인다. 덧붙여, 인덱스를 사용한 기존의 통신 유도 검사점 기법은 일관적인 전역 회복점(consistent global cut)을 찾는데 비효율적임을 보인다.

Keywords

References

  1. E. N. Elnozahy, L. Alvisi, Y. -M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, oct 1996
  2. K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems, 3(1):63-75, AUG 1985 https://doi.org/10.1145/214451.214456
  3. R. Koo and S. Toueg. Checkpointing and rollback recovery for distributed systems. IEEE Transaction on Software Engineering, SE-13(1):23-31, 1987 https://doi.org/10.1109/TSE.1987.232562
  4. T. Park and H. Y. Yeom. Application controlled checkpointing coordination for fault tolerant distributed computing systems. Parallel Computing, 26(4):467-482, MAR 2000 https://doi.org/10.1016/S0167-8191(99)00112-X
  5. L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic and causal. In Proceedings of the 15th International Conference on Distributed Computing Systems, pages 229-236, 1995 https://doi.org/10.1109/ICDCS.1995.500024
  6. N. Neves and W. K. Fuchs. RENEW: A tool for fast and efficient implementation of checkpoint protocols. In Symposium on Fault-Tolerant Computing, pages 58-67, 1998 https://doi.org/10.1109/FTCS.1998.689455
  7. Y. -M. Wang and W. K. Fuchs. Optimistic message logging for independent checkpointing in message-passing systems. In Symposium on Reliable Distributed Systems, pages 147-154, 1992 https://doi.org/10.1109/RELDIS.1992.235132
  8. L. Alvisi, E. N. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages 242-249, 1999 https://doi.org/10.1109/FTCS.1999.781058
  9. D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In Proceedings of the IEEE International Symposium on Reliability Distributed Software wand Database, pages 207-215, DEC 1984
  10. J. Helary, A. Mostefaoui, R. Netzer, and M. Raynal. Preventing useless checkpoints in distributed computations. In Proceedings of IEEE International Symposium on Reliable Distributed Systems, pages 183-190, 1997 https://doi.org/10.1109/RELDIS.1997.632814
  11. R. Baldoni, F. Quaglia, and B. Ciciani. A VP-accordant checkpointing protocol preventing useless checkpoints. In Symposium on Reliable Distributed Systems, pages 61-67, 1998 https://doi.org/10.1109/RELDIS.1998.740475
  12. R. Baldoni, J. H'elary, and M. Raynal. Rollback-dependency trackability. Technical Report Report 1107, IRISA Research, MAY 1997
  13. L. Lamport, 'Time, Clocks, and the Ordering of Events in a Distributed System,' Comm. of the ACM, Vol.21, No.7, pp.558-564, Jul., 1978 https://doi.org/10.1145/359545.359563
  14. R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems, 6(2):165-169, 1995 https://doi.org/10.1109/71.342127
  15. F. Quaglia, R. Baldoni, and B. Ciciani. On the no-z-cycle property in distributed executions. Journal of Computer and System Sciences, 61(3): 400-427, 2000 https://doi.org/10.1006/jcss.2000.1720
  16. Y. Nah. The Specification of Task Communication Patterns. PhD thesis, Seoul National University, Korea, 1997
  17. G. Andrews. Paradigms for process interaction in distributed programs. ACM Computing Surveys, 23(1):49-90, 1991 https://doi.org/10.1145/103162.103164